* topics for the file system mini-summit
@ 2006-05-25 21:44 Ric Wheeler
2006-05-26 16:48 ` Andreas Dilger
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Ric Wheeler @ 2006-05-25 21:44 UTC (permalink / raw)
To: linux-fsdevel
Now that the IO mini-summit has solved all known issues under file
systems, I thought that I should throw out a list of challenges/problems
that I see with linux file systems (specifically with ext3 & reiserfs) ;-)
With the background of very large (and increasingly larger) commodity
drives, we are seeing single consumer drives of 500GB today with even
larger drives coming soon. Of course, array based storage capacities
are a large multiple of these (many terabytes per lun).
With both ext3 and with reiserfs, running a single large file system
translates into several practical limitations before we even hit the
existing size limitations:
(1) repair/fsck time can take hours or even days depending on the
health of the file system and its underlying disk as well as the number
of files. This does not work well for large servers and is a disaster
for "appliances" that need to run these commands buried deep in some
data center without a person watching...
(2) most file system performance testing is done on "pristine" file
systems with very few files. Performance over time, especially with
very high file counts, suffers very noticeable performance degradation
with very large file systems.
(3) very poor fault containment for these very large devices - it
would be great to be able to ride through a failure of a segment of the
underlying storage without taking down the whole file system.
The obvious alternative to this is to break up these big disks into
multiple small file systems, but there again we hit several issues.
As an example, in one of the boxes that I work with we have 4 drives,
each 500GBs, with limited memory and CPU resources. To address the
issues above, we break each drive into 100GB chunks which gives us 20
(reiserfs) file systems per box. The set of new problems that arise
from this include:
(1) no forced unmount - one file system goes down, you have to
reboot the box to recover.
(2) worst case memory consumption for the journal scales linearly
with the number of file systems (32MB/per file system).
(3) we take away the ability of the file system to do intelligent
head movement on the drives (i.e., I end up begging the application team
to please only use one file system per drive at a time for ingest ;-)).
The same goes for allocation - we basically have to push this up to the
application to use the capacity in an even way.
(4) pain of administration of multiple file systems.
I know that other file systems deal with scale better, but the question
is really how to move the mass of linux users onto these large and
increasingly common storage devices in a way that handles these challenges.
ric
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: topics for the file system mini-summit 2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler @ 2006-05-26 16:48 ` Andreas Dilger 2006-05-27 0:49 ` Ric Wheeler 2006-05-29 0:11 ` Matthew Wilcox 2006-06-01 2:19 ` Valerie Henson 2 siblings, 1 reply; 24+ messages in thread From: Andreas Dilger @ 2006-05-26 16:48 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-fsdevel On May 25, 2006 14:44 -0700, Ric Wheeler wrote: > With both ext3 and with reiserfs, running a single large file system > translates into several practical limitations before we even hit the > existing size limitations: > > (1) repair/fsck time can take hours or even days depending on the > health of the file system and its underlying disk as well as the number > of files. This does not work well for large servers and is a disaster > for "appliances" that need to run these commands buried deep in some > data center without a person watching... > (2) most file system performance testing is done on "pristine" file > systems with very few files. Performance over time, especially with > very high file counts, suffers very noticeable performance degradation > with very large file systems. > (3) very poor fault containment for these very large devices - it > would be great to be able to ride through a failure of a segment of the > underlying storage without taking down the whole file system. > > The obvious alternative to this is to break up these big disks into > multiple small file systems, but there again we hit several issues. > > As an example, in one of the boxes that I work with we have 4 drives, > each 500GBs, with limited memory and CPU resources. To address the > issues above, we break each drive into 100GB chunks which gives us 20 > (reiserfs) file systems per box. The set of new problems that arise > from this include: > > (1) no forced unmount - one file system goes down, you have to > reboot the box to recover. > (2) worst case memory consumption for the journal scales linearly > with the number of file systems (32MB/per file system). > (3) we take away the ability of the file system to do intelligent > head movement on the drives (i.e., I end up begging the application team > to please only use one file system per drive at a time for ingest ;-)). > The same goes for allocation - we basically have to push this up to the > application to use the capacity in an even way. > (4) pain of administration of multiple file systems. > > I know that other file systems deal with scale better, but the question > is really how to move the mass of linux users onto these large and > increasingly common storage devices in a way that handles these challenges. In a way what you describe is Lustre - it aggregates multiple "smaller" filesystems into a single large filesystem from the application POV (though in many cases "smaller" filesystems are 2TB). It runs e2fsck in parallel if needed, has smart object allocation (clients do delayed allocation, can load balance across storage targets, etc), can run with down storage targets. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-26 16:48 ` Andreas Dilger @ 2006-05-27 0:49 ` Ric Wheeler 2006-05-27 14:18 ` Andreas Dilger 0 siblings, 1 reply; 24+ messages in thread From: Ric Wheeler @ 2006-05-27 0:49 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-fsdevel Andreas Dilger wrote: >On May 25, 2006 14:44 -0700, Ric Wheeler wrote: > > >>With both ext3 and with reiserfs, running a single large file system >>translates into several practical limitations before we even hit the >>existing size limitations: >> >> .... >> >>I know that other file systems deal with scale better, but the question >>is really how to move the mass of linux users onto these large and >>increasingly common storage devices in a way that handles these challenges. >> >> > >In a way what you describe is Lustre - it aggregates multiple "smaller" >filesystems into a single large filesystem from the application POV >(though in many cases "smaller" filesystems are 2TB). It runs e2fsck >in parallel if needed, has smart object allocation (clients do delayed >allocation, can load balance across storage targets, etc), can run with >down storage targets. > >Cheers, Andreas >-- >Andreas Dilger >Principal Software Engineer >Cluster File Systems, Inc. > > > > The approach that lustre takes here is great - distributed systems typically take into account subcomponent failures as a fact of life & do this better than many single system designs... The challenge is still there on the "smaller" file systems that make up Lustre - you can spend a lot of time waiting for just one fsck to finish ;-) ric ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-27 0:49 ` Ric Wheeler @ 2006-05-27 14:18 ` Andreas Dilger 2006-05-28 1:44 ` Ric Wheeler 0 siblings, 1 reply; 24+ messages in thread From: Andreas Dilger @ 2006-05-27 14:18 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-fsdevel On May 26, 2006 20:49 -0400, Ric Wheeler wrote: > Andreas Dilger wrote: > >In a way what you describe is Lustre - it aggregates multiple "smaller" > >filesystems into a single large filesystem from the application POV > >(though in many cases "smaller" filesystems are 2TB). It runs e2fsck > >in parallel if needed, has smart object allocation (clients do delayed > >allocation, can load balance across storage targets, etc), can run with > >down storage targets. > > The approach that lustre takes here is great - distributed systems > typically take into account subcomponent failures as a fact of life & > do this better than many single system designs... > > The challenge is still there on the "smaller" file systems that make up > Lustre - you can spend a lot of time waiting for just one fsck to finish ;-) CFS is actually quite interested in improving the health and reliability of the component filesystems also. That is the reason for our interest in the U. Wisconsin IRON filesystem work, which we are (slowly) working to include into ext3. This will also be our focus for upcoming filesystem work. It is relatively easy to make filesystems with 64-bit structures, but the ability to run such large filesystems in the face of corruption environments is the real challenge. It isn't practical to need a 17-year e2fsck time, extrapolating 2TB e2fsck times to 2^48 block filesystems. A lot of the features in ZFS make sense in this regard. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-27 14:18 ` Andreas Dilger @ 2006-05-28 1:44 ` Ric Wheeler 0 siblings, 0 replies; 24+ messages in thread From: Ric Wheeler @ 2006-05-28 1:44 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-fsdevel Andreas Dilger wrote: >On May 26, 2006 20:49 -0400, Ric Wheeler wrote: > > >>Andreas Dilger wrote: >> >> >>>In a way what you describe is Lustre - it aggregates multiple "smaller" >>>filesystems into a single large filesystem from the application POV >>>(though in many cases "smaller" filesystems are 2TB). It runs e2fsck >>>in parallel if needed, has smart object allocation (clients do delayed >>>allocation, can load balance across storage targets, etc), can run with >>>down storage targets. >>> >>> >>The approach that lustre takes here is great - distributed systems >>typically take into account subcomponent failures as a fact of life & >>do this better than many single system designs... >> >>The challenge is still there on the "smaller" file systems that make up >>Lustre - you can spend a lot of time waiting for just one fsck to finish ;-) >> >> > >CFS is actually quite interested in improving the health and reliability >of the component filesystems also. That is the reason for our interest >in the U. Wisconsin IRON filesystem work, which we are (slowly) working >to include into ext3. > > We actually were the sponsors of the Wisconsin work, so I am glad to hear that it has a real impact. I think that the Iron FS ideas will help, but they still don't eliminate the issues of scalability with fsck (and some of the scalability issues I see where performance dips with high object count file systems). >This will also be our focus for upcoming filesystem work. It is >relatively easy to make filesystems with 64-bit structures, but the >ability to run such large filesystems in the face of corruption >environments is the real challenge. It isn't practical to need a >17-year e2fsck time, extrapolating 2TB e2fsck times to 2^48 block >filesystems. A lot of the features in ZFS make sense in this regard. > >Cheers, Andreas >-- > > > Absolutely agree - I wonder if there is some value in trying to go back to profiling fsck if someone has not already done that. It won't get rid of the design limitations, but we might be able to make some significant improvements... ric ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler 2006-05-26 16:48 ` Andreas Dilger @ 2006-05-29 0:11 ` Matthew Wilcox 2006-05-29 2:07 ` Ric Wheeler 2006-06-01 2:19 ` Valerie Henson 2 siblings, 1 reply; 24+ messages in thread From: Matthew Wilcox @ 2006-05-29 0:11 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-fsdevel On Thu, May 25, 2006 at 02:44:50PM -0700, Ric Wheeler wrote: > The obvious alternative to this is to break up these big disks into > multiple small file systems, but there again we hit several issues. > > As an example, in one of the boxes that I work with we have 4 drives, > each 500GBs, with limited memory and CPU resources. To address the > issues above, we break each drive into 100GB chunks which gives us 20 > (reiserfs) file systems per box. The set of new problems that arise > from this include: > > (1) no forced unmount - one file system goes down, you have to > reboot the box to recover. > (2) worst case memory consumption for the journal scales linearly > with the number of file systems (32MB/per file system). > (3) we take away the ability of the file system to do intelligent > head movement on the drives (i.e., I end up begging the application team > to please only use one file system per drive at a time for ingest ;-)). > The same goes for allocation - we basically have to push this up to the > application to use the capacity in an even way. > (4) pain of administration of multiple file systems. > > I know that other file systems deal with scale better, but the question > is really how to move the mass of linux users onto these large and > increasingly common storage devices in a way that handles these challenges. How do you handle the inode number space? Do you partition it across the sub-filesystems, or do you prohibit hardlinks between the sub-fses? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-29 0:11 ` Matthew Wilcox @ 2006-05-29 2:07 ` Ric Wheeler 2006-05-29 16:09 ` Andreas Dilger 0 siblings, 1 reply; 24+ messages in thread From: Ric Wheeler @ 2006-05-29 2:07 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel Matthew Wilcox wrote: >On Thu, May 25, 2006 at 02:44:50PM -0700, Ric Wheeler wrote: > > >>The obvious alternative to this is to break up these big disks into >>multiple small file systems, but there again we hit several issues. >> >>As an example, in one of the boxes that I work with we have 4 drives, >>each 500GBs, with limited memory and CPU resources. To address the >>issues above, we break each drive into 100GB chunks which gives us 20 >>(reiserfs) file systems per box. The set of new problems that arise >>from this include: >> >> (1) no forced unmount - one file system goes down, you have to >>reboot the box to recover. >> (2) worst case memory consumption for the journal scales linearly >>with the number of file systems (32MB/per file system). >> (3) we take away the ability of the file system to do intelligent >>head movement on the drives (i.e., I end up begging the application team >>to please only use one file system per drive at a time for ingest ;-)). >>The same goes for allocation - we basically have to push this up to the >>application to use the capacity in an even way. >> (4) pain of administration of multiple file systems. >> >>I know that other file systems deal with scale better, but the question >>is really how to move the mass of linux users onto these large and >>increasingly common storage devices in a way that handles these challenges. >> >> > >How do you handle the inode number space? Do you partition it across >the sub-filesystems, or do you prohibit hardlinks between the sub-fses? > > I think that the namespace needs to present a normal file system set of operations - support for hardlinks, no magic directories, etc. so that applications don't need to load balance (or even be aware) of the sub-units that provide storage. If we removed that requirement, we would be back to today's collection of various file systems mounted on a single host. I know that lustre aggregates full file systems, but you could build a file system on top of a collection of disk partitions/LUN's and then your inode would could be extended to encode the partition number and the internal mapping. You could even harden the block groups to the point that fsck could heal one group while the file system was (mostly?) online backed up by the rest of the block groups... ric ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-29 2:07 ` Ric Wheeler @ 2006-05-29 16:09 ` Andreas Dilger 2006-05-29 19:29 ` Ric Wheeler 2006-06-07 10:10 ` Stephen C. Tweedie 0 siblings, 2 replies; 24+ messages in thread From: Andreas Dilger @ 2006-05-29 16:09 UTC (permalink / raw) To: Ric Wheeler; +Cc: Matthew Wilcox, linux-fsdevel On May 28, 2006 22:07 -0400, Ric Wheeler wrote: > I think that the namespace needs to present a normal file system set of > operations - support for hardlinks, no magic directories, etc. so that > applications don't need to load balance (or even be aware) of the > sub-units that provide storage. If we removed that requirement, we > would be back to today's collection of various file systems mounted on a > single host. > > I know that lustre aggregates full file systems Yes - we have a metadata-only filesystem which exports the inode numbers and namespace, and then separate (essentially private) filesystems that store all of the data. The object store filesystems do not export any namespace that is visible to userspace. > you could build a > file system on top of a collection of disk partitions/LUN's and then > your inode would could be extended to encode the partition number and > the internal mapping. You could even harden the block groups to the > point that fsck could heal one group while the file system was (mostly?) > online backed up by the rest of the block groups... This is one thing that we have been thinking of for ext3. Instead of a filesystem-wide "error" bit we could move this per-group to only mark the block or inode bitmaps in error if they have a checksum failure. This would prevent allocations from that group to avoid further potential corruption of the filesystem metadata. Once an error is detected then a filesystem service thread or a userspace helper would walk the inode table (starting in the current group, which is most likely to hold the relevant data) recreating the respective bitmap table and keeping a "valid bit" bitmap as well. Once all of the bits in the bitmap are marked valid then we can start using this group again. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-29 16:09 ` Andreas Dilger @ 2006-05-29 19:29 ` Ric Wheeler 2006-05-30 6:14 ` Andreas Dilger 2006-06-07 10:10 ` Stephen C. Tweedie 1 sibling, 1 reply; 24+ messages in thread From: Ric Wheeler @ 2006-05-29 19:29 UTC (permalink / raw) To: Andreas Dilger; +Cc: Matthew Wilcox, linux-fsdevel Andreas Dilger wrote: >On May 28, 2006 22:07 -0400, Ric Wheeler wrote: > > >>you could build a >>file system on top of a collection of disk partitions/LUN's and then >>your inode would could be extended to encode the partition number and >>the internal mapping. You could even harden the block groups to the >>point that fsck could heal one group while the file system was (mostly?) >>online backed up by the rest of the block groups... >> >> > >This is one thing that we have been thinking of for ext3. Instead of a >filesystem-wide "error" bit we could move this per-group to only mark >the block or inode bitmaps in error if they have a checksum failure. >This would prevent allocations from that group to avoid further potential >corruption of the filesystem metadata. > >Once an error is detected then a filesystem service thread or a userspace >helper would walk the inode table (starting in the current group, which >is most likely to hold the relevant data) recreating the respective bitmap >table and keeping a "valid bit" bitmap as well. Once all of the bits >in the bitmap are marked valid then we can start using this group again. > > > > That is a neat idea - would you lose complete access to the impacted group, or have you thought about "best effort" read-only while under repair? One thing that has worked very well for us is that we keep a digital signature of each user object (MD5, SHAX hash, etc) so we can validate that what we wrote is what got read back. This also provides a very powerful sanity check after getting hit by failing media or severe file system corruption since what ever we do manage to salvage (which might not be all files) can be validated. As an archival (write once, read infrequently) storage device, this works pretty well for us since the signature does not need to constantly recomputed on each write/append. For general purpose read/write work loads, I wonder if it would make sense to compute and store such a checksum or signature on close (say in an extended attribute)? It might be useful to use another of those special attributes (like immutable attribute) to indicate that this file is important enough to digitally sign on close. Regards, Ric ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-29 19:29 ` Ric Wheeler @ 2006-05-30 6:14 ` Andreas Dilger 0 siblings, 0 replies; 24+ messages in thread From: Andreas Dilger @ 2006-05-30 6:14 UTC (permalink / raw) To: Ric Wheeler; +Cc: Matthew Wilcox, linux-fsdevel On May 29, 2006 15:29 -0400, Ric Wheeler wrote: > Andreas Dilger wrote: > >Instead of a filesystem-wide "error" bit we could move this per-group to > >only mark the block or inode bitmaps in error if they have a checksum > >failure. This would prevent allocations from that group to avoid further > >potential corruption of the filesystem metadata. > > > >Once an error is detected then a filesystem service thread or a userspace > >helper would walk the inode table (starting in the current group, which > >is most likely to hold the relevant data) recreating the respective bitmap > >table and keeping a "valid bit" bitmap as well. Once all of the bits > >in the bitmap are marked valid then we can start using this group again. > > That is a neat idea - would you lose complete access to the impacted > group, or have you thought about "best effort" read-only while under repair? I think we would only need to prevent new allocation from the group if the bitmap is corrupted. The extent format already has a magic number to give a very quick sanity check (unlike indirect blocks which can be filled with random garbage on large filesystems and still appear valid). We are looking at adding checksums in the extent metadata and could also do extra internal consistency checks to validate this metadata (e.g. sequential ordering of logical offsets, non-overlapping logical offsets, proper parent->child logical offset heirarchy, etc). So, we are mostly safe from the "incorrect block free" side, and just need to worry about the "block is free in bitmap, don't reallocate" problem. Allowing unlinks in a group also allows the "valid" bitmap to be updated when the bits are cleared, so this is beneficial to the end goal of getting an all-valid block bitmap. We could even get more fancy and allow blocks marked valid to be used for allocations, but that is more complex than I like. > One thing that has worked very well for us is that we keep a digital > signature of each user object (MD5, SHAX hash, etc) so we can validate > that what we wrote is what got read back. This also provides a very > powerful sanity check after getting hit by failing media or severe file > system corruption since what ever we do manage to salvage (which might > not be all files) can be validated. Yes, we've looked at this also for Lustre (we can already do checksums from the client memory down to the server disk), but the problem of consistency in the face of write/truncate/append and a crash is complex. There's also the issue of whether to do partial-file checksums (in order to allow more efficient updates) or full-file checksums. I believe at one point there was work on a checksum loop device, but this also has potential consistency problems in the face of a crash. > For general purpose read/write work loads, I wonder if it would make > sense to compute and store such a checksum or signature on close (say in > an extended attribute)? It might be useful to use another of those > special attributes (like immutable attribute) to indicate that this file > is important enough to digitally sign on close. Hmm, good idea. If a file is immutable that makes it fairly certain it won't be modified any time soon so a good candidate for checksumming. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-29 16:09 ` Andreas Dilger 2006-05-29 19:29 ` Ric Wheeler @ 2006-06-07 10:10 ` Stephen C. Tweedie 2006-06-07 14:03 ` Andi Kleen 2006-06-07 18:55 ` Andreas Dilger 1 sibling, 2 replies; 24+ messages in thread From: Stephen C. Tweedie @ 2006-06-07 10:10 UTC (permalink / raw) To: Andreas Dilger Cc: Ric Wheeler, Matthew Wilcox, linux-fsdevel, Stephen Tweedie Hi, On Mon, 2006-05-29 at 10:09 -0600, Andreas Dilger wrote: > This is one thing that we have been thinking of for ext3. Instead of a > filesystem-wide "error" bit we could move this per-group to only mark > the block or inode bitmaps in error if they have a checksum failure. > This would prevent allocations from that group to avoid further potential > corruption of the filesystem metadata. Trouble is, individual files can span multiple groups easily. And one of the common failure modes is failure in the indirect tree. What action do you take if you detect that? There is fundamentally a large difference between the class of errors that can arise due to EIO --- simple loss of a block of data --- and those which can arise from actual corrupt data/metadata. If we detect the latter and attempt to soldier on regardless, then we have no idea what inconsistencies we are allowing to be propagated through the filesystem. That can easily end up corrupting files far from the actual error. Say an indirect block is corrupted; we delete that file, and end up freeing a block belonging to some other file on a distant block group. Ooops. Once that other block gets reallocated and overwritten, we have corrupted that other file. *That* is why taking the fs down/readonly on failure is the safe option. The inclusion of checksums would certainly allow us to harden things. In the above scenario, failure of the checksum test would allow us to discard corrupt indirect blocks before we could allow any harm to come to other disk blocks. But that only works for cases where the checksum notices the problem; if we're talking about possible OS bugs, memory corruption etc. then it is quite possible to get corruption in the in- memory copy, which gets properly checksummed and written to disk, so you can't rely on that catching all cases. --Stephen ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-07 10:10 ` Stephen C. Tweedie @ 2006-06-07 14:03 ` Andi Kleen 2006-06-07 18:55 ` Andreas Dilger 1 sibling, 0 replies; 24+ messages in thread From: Andi Kleen @ 2006-06-07 14:03 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ric Wheeler, Matthew Wilcox, linux-fsdevel "Stephen C. Tweedie" <sct@redhat.com> writes: > The inclusion of checksums would certainly allow us to harden things. > In the above scenario, failure of the checksum test would allow us to > discard corrupt indirect blocks before we could allow any harm to come > to other disk blocks. But that only works for cases where the checksum > notices the problem; if we're talking about possible OS bugs, memory > corruption etc. then it is quite possible to get corruption in the in- > memory copy, which gets properly checksummed and written to disk, so you > can't rely on that catching all cases. I don't think you'll ever get a good solution for random kernel memory corruption - if that happens you are dead no matter what you do. Even if your file system still works then your application will eventually produce garbage when its own data gets corrupted. Limiting detection to on storage corruption is entirely reasonable. And also handling 100% of all cases is not feasible anyways. Just handling more than currently would be already a big step forward. "The perfect is the enemy of the good" -Andi ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-07 10:10 ` Stephen C. Tweedie 2006-06-07 14:03 ` Andi Kleen @ 2006-06-07 18:55 ` Andreas Dilger 1 sibling, 0 replies; 24+ messages in thread From: Andreas Dilger @ 2006-06-07 18:55 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ric Wheeler, Matthew Wilcox, linux-fsdevel On Jun 07, 2006 11:10 +0100, Stephen C. Tweedie wrote: > On Mon, 2006-05-29 at 10:09 -0600, Andreas Dilger wrote: > > This is one thing that we have been thinking of for ext3. Instead of a > > filesystem-wide "error" bit we could move this per-group to only mark > > the block or inode bitmaps in error if they have a checksum failure. > > This would prevent allocations from that group to avoid further potential > > corruption of the filesystem metadata. > > Trouble is, individual files can span multiple groups easily. And one > of the common failure modes is failure in the indirect tree. What > action do you take if you detect that? Return an IO error for that part of the file? We already refuse to free file blocks that overlap with filesystem metadata, but have no way to know whether the rest of the blocks are valid or not. > There is fundamentally a large difference between the class of errors > that can arise due to EIO --- simple loss of a block of data --- and > those which can arise from actual corrupt data/metadata. If we detect > the latter and attempt to soldier on regardless, then we have no idea > what inconsistencies we are allowing to be propagated through the > filesystem. Recall that one of the other goals is to add checksumming to the extent tree metadata (if it isn't already covered by the inode checksum). Even today, the fact that the extent format has a magic allows some types of corruption to be detected. The structure is also somewhat verifiable (e.g. logical extent offsets are increasing, logical_offset + length is non-overlapping with next logical offset, etc) even without checksums. The proposed ext3_extent_tail would also contain an inode+generation back-reference and the checksum would depend on the physical block location so if one extent index block were incorrectly written in the place of another, or the higher-level reference were corrupted this would also be detectable. struct ext3_extent_tail { __u64 et_inum; __u32 et_igeneration; __u32 et_checksum; } > That can easily end up corrupting files far from the actual error. Say > an indirect block is corrupted; we delete that file, and end up freeing > a block belonging to some other file on a distant block group. Ooops. > Once that other block gets reallocated and overwritten, we have > corrupted that other file. Oh, I totally agree with that, which is another reason why I've proposed the "block mapped extent" several times. It would be referenced from an extent index block or inode, would start with an extent header to verify that this is at least semi-plausible block pointers, and can optionally have an ext3_extent_tail to validate the block data itself. The block-mapped extent is useful for fragmented files or files with lots of small holes in them. Concievably it would also be possible to quickly remap old block-mapped (indirect tree) files to bm-extent files if this was desirable. > *That* is why taking the fs down/readonly on failure is the safe option. And wait 17 years for e2fsck to complete? While I agree it is the safest option, sometimes it is necessary to just block off parts of the filesystem from writes and soldier on until the system can be taken down safely. > The inclusion of checksums would certainly allow us to harden things. > In the above scenario, failure of the checksum test would allow us to > discard corrupt indirect blocks before we could allow any harm to come > to other disk blocks. But that only works for cases where the checksum > notices the problem; if we're talking about possible OS bugs, memory > corruption etc. then it is quite possible to get corruption in the in- > memory copy, which gets properly checksummed and written to disk, so you > can't rely on that catching all cases. I agree, we can't ever handle everything unless we get checksums from the top of linux to the bottom (maybe stored in the page table?), but we can at least do the best we can. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler 2006-05-26 16:48 ` Andreas Dilger 2006-05-29 0:11 ` Matthew Wilcox @ 2006-06-01 2:19 ` Valerie Henson 2006-06-01 2:42 ` Matthew Wilcox ` (2 more replies) 2 siblings, 3 replies; 24+ messages in thread From: Valerie Henson @ 2006-06-01 2:19 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-fsdevel, Arjan van de Ven On Thu, May 25, 2006 at 02:44:50PM -0700, Ric Wheeler wrote: > > (1) repair/fsck time can take hours or even days depending on the > health of the file system and its underlying disk as well as the number > of files. This does not work well for large servers and is a disaster > for "appliances" that need to run these commands buried deep in some > data center without a person watching... > (2) most file system performance testing is done on "pristine" file > systems with very few files. Performance over time, especially with > very high file counts, suffers very noticeable performance degradation > with very large file systems. > (3) very poor fault containment for these very large devices - it > would be great to be able to ride through a failure of a segment of the > underlying storage without taking down the whole file system. > > The obvious alternative to this is to break up these big disks into > multiple small file systems, but there again we hit several issues. 1 and 3 are some of my main concerns, and what I want to focus a lot of the workshop discussion on. I view the question as: How do we keep file system management simple while splitting the underlying storage into isolated failure domains that can be repaired individually online? (Say that three times fast.) Just splitting up into multiple file systems only solves the second problem, and only if you have forced umount, as you noted. The approach we took in ZFS was to separate namespace management and allocation management. File systems aren't a fixed size, they take up as much space as they need from a shared underlying pool. You can think of a file system in ZFS as a movable directory with management bits attached. I don't think this is the direction we should go, but it's an example of separating your namespace management from a lot of other stuff it doesn't really need to be attached to. I don't think a block group is a good enough fault isolation domain - think hard links. What I think we need is normal file system structures when you are referencing stuff inside your fault isolation domain, and something more complicated if you have to reference stuff outside. One of Arjan's ideas involves something we're calling continuation inodes - if the file's data is stored in multiple domains, it has a separate continuation inode in each domain, and each continuation inode has all the information necessary to run a full fsck on the data inside that domain. Similarly, if a directory has a hard link to a file outside its domain, we'll have to allocate a continuation inode and dir entry block in the domain containing the file. The idea is that you can run fsck on a domain without having to go look outside that domain. You may have to clean up a few things in other domains, but they are easy to find and don't require an fsck in other domains. -VAL ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-01 2:19 ` Valerie Henson @ 2006-06-01 2:42 ` Matthew Wilcox 2006-06-01 3:24 ` Valerie Henson 2006-06-01 5:36 ` Andreas Dilger 2006-06-03 13:50 ` Ric Wheeler 2 siblings, 1 reply; 24+ messages in thread From: Matthew Wilcox @ 2006-06-01 2:42 UTC (permalink / raw) To: Valerie Henson; +Cc: Ric Wheeler, linux-fsdevel, Arjan van de Ven On Wed, May 31, 2006 at 07:19:09PM -0700, Valerie Henson wrote: > I don't think a block group is a good enough fault isolation domain - > think hard links. What I think we need is normal file system > structures when you are referencing stuff inside your fault isolation > domain, and something more complicated if you have to reference stuff > outside. One of Arjan's ideas involves something we're calling > continuation inodes - if the file's data is stored in multiple > domains, it has a separate continuation inode in each domain, and each > continuation inode has all the information necessary to run a full > fsck on the data inside that domain. Similarly, if a directory has a > hard link to a file outside its domain, we'll have to allocate a > continuation inode and dir entry block in the domain containing the > file. The idea is that you can run fsck on a domain without having to > go look outside that domain. You may have to clean up a few things in > other domains, but they are easy to find and don't require an fsck in > other domains. I don't quite get it. Let's say we have directories A and B in domain A and domain B. A file C is created in directory B and is thus allocated in domain B. Now we create a link to file C in directory A, and remove the link from directory B. Presumably we have a continuation inode in domain A and an inode with no references to it in domain B. How does fsck tell that there's a continuation inode in a different domain? The obvious answer is to put a special flag on the continued inode ... but then the question becomes more about the care and feeding of said flag. Maybe it's a counter. But then fsck needs to check the counter's not corrupt. Or is it backlinks from the inode to all its continuation inodes? That quickly gets messy with locking. Surely XFS must have a more elegant solution than this? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-01 2:42 ` Matthew Wilcox @ 2006-06-01 3:24 ` Valerie Henson 2006-06-01 12:45 ` Matthew Wilcox 0 siblings, 1 reply; 24+ messages in thread From: Valerie Henson @ 2006-06-01 3:24 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Ric Wheeler, linux-fsdevel, Arjan van de Ven On Wed, May 31, 2006 at 08:42:47PM -0600, Matthew Wilcox wrote: > On Wed, May 31, 2006 at 07:19:09PM -0700, Valerie Henson wrote: > > I don't think a block group is a good enough fault isolation domain - > > think hard links. What I think we need is normal file system > > structures when you are referencing stuff inside your fault isolation > > domain, and something more complicated if you have to reference stuff > > outside. One of Arjan's ideas involves something we're calling > > continuation inodes - if the file's data is stored in multiple > > domains, it has a separate continuation inode in each domain, and each > > continuation inode has all the information necessary to run a full > > fsck on the data inside that domain. Similarly, if a directory has a > > hard link to a file outside its domain, we'll have to allocate a > > continuation inode and dir entry block in the domain containing the > > file. The idea is that you can run fsck on a domain without having to > > go look outside that domain. You may have to clean up a few things in > > other domains, but they are easy to find and don't require an fsck in > > other domains. > > I don't quite get it. Let's say we have directories A and B in domain A > and domain B. A file C is created in directory B and is thus allocated > in domain B. Now we create a link to file C in directory A, and remove > the link from directory B. All correct up to here... > Presumably we have a continuation inode in > domain A and an inode with no references to it in domain B. How does > fsck tell that there's a continuation inode in a different domain? Actually, the continuation inode is in B. When we create a link in directory A to file C, a continuation inode for directory A is created in domain B, and a block containing the link to file C is allocated inside domain B as well. So there is no continuation inode in domain A. That being said, this idea is at the hand-waving stage and probably has many other (hopefully non-fatal) flaws. Thanks for taking a look! > Surely XFS must have a more elegant solution than this? val@goober:/usr/src/linux-2.6.16.19$ wc -l `find fs/xfs/ -type f` [snip] 109083 total :) -VAL ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-01 3:24 ` Valerie Henson @ 2006-06-01 12:45 ` Matthew Wilcox 2006-06-01 12:53 ` Arjan van de Ven ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Matthew Wilcox @ 2006-06-01 12:45 UTC (permalink / raw) To: Valerie Henson; +Cc: Ric Wheeler, linux-fsdevel, Arjan van de Ven On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote: > Actually, the continuation inode is in B. When we create a link in > directory A to file C, a continuation inode for directory A is created > in domain B, and a block containing the link to file C is allocated > inside domain B as well. So there is no continuation inode in domain > A. > > That being said, this idea is at the hand-waving stage and probably > has many other (hopefully non-fatal) flaws. Thanks for taking a look! OK, so we really have two kinds of continuation inodes, and it might be sensible to name them differently. We have "here's some extra data for that inode over there" and "here's a hardlink from another domain". I dub the first one a 'continuation inode' and the second a 'shadow inode'. Continuation inodes and shadow inodes both suffer from the problem that they might be unwittingly orphaned, unless they have some kind of back-link to their referrer. That seems more soluble though. The domain B minifsck can check to see if the backlinked inode or directory is still there. If the domain A minifsck prunes something which has a link to domain B, it should be able to just remove the continuation/shadow inode there, without fscking domain B. Another advantage to this is that inodes never refer to blocks outside their zone, so we can forget about all this '64-bit block number' crap. We don't even need 64-bit inode numbers -- we can use special direntries for shadow inodes, and inodes which refer to continuation inodes need a new encoding scheme anyway. Normal inodes would remain 32-bit and refer to the local domain, and shadow/continuation inode numbers would be 32-bits of domain, plus 32-bits of inode within that domain. So I like this ;-) > > Surely XFS must have a more elegant solution than this? > > val@goober:/usr/src/linux-2.6.16.19$ wc -l `find fs/xfs/ -type f` > [snip] > 109083 total Well, yes. I think that inside the Linux XFS implementation there's a small and neat filesystem struggling to get out. Once SGI finally dies, perhaps we can rip out all the CXFS stubs and IRIX combatability. Then we might be able to see it. For fun, if you're a masochist, try to follow the code flow for something easy like fsync(). const struct file_operations xfs_file_operations = { .fsync = xfs_file_fsync, } xfs_file_fsync(struct file *filp, struct dentry *dentry, int datasync) { struct inode *inode = dentry->d_inode; vnode_t *vp = vn_from_inode(inode); int error; int flags = FSYNC_WAIT; if (datasync) flags |= FSYNC_DATA; VOP_FSYNC(vp, flags, NULL, (xfs_off_t)0, (xfs_off_t)-1, error); return -error; } #define _VOP_(op, vp) (*((vnodeops_t *)(vp)->v_fops)->op) #define VOP_FSYNC(vp,f,cr,b,e,rv) \ rv = _VOP_(vop_fsync, vp)((vp)->v_fbhv,f,cr,b,e) vnodeops_t xfs_vnodeops = { .vop_fsync = xfs_fsync, } Finally, xfs_fsync actually does the work. The best bit about all this abstraction is that there's only one xfs_vnodeops defined! So this could all be done with an xfs_file_fsync() that munged its parameters and called xfs_fsync() directly. That wouldn't even affect IRIX combatability, but it would make life difficult for CXFS, apparently. http://oss.sgi.com/projects/xfs/mail_archive/200308/msg00214.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-01 12:45 ` Matthew Wilcox @ 2006-06-01 12:53 ` Arjan van de Ven 2006-06-01 20:06 ` Russell Cattelan 2006-06-02 11:27 ` Nathan Scott 2 siblings, 0 replies; 24+ messages in thread From: Arjan van de Ven @ 2006-06-01 12:53 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Valerie Henson, Ric Wheeler, linux-fsdevel Matthew Wilcox wrote: > On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote: >> Actually, the continuation inode is in B. When we create a link in >> directory A to file C, a continuation inode for directory A is created >> in domain B, and a block containing the link to file C is allocated >> inside domain B as well. So there is no continuation inode in domain >> A. >> >> That being said, this idea is at the hand-waving stage and probably >> has many other (hopefully non-fatal) flaws. Thanks for taking a look! > > OK, so we really have two kinds of continuation inodes, and it might be > sensible to name them differently. We have "here's some extra data for > that inode over there" and "here's a hardlink from another domain". I > dub the first one a 'continuation inode' and the second a 'shadow inode'. nonono the "hardlink" is in a directory inode, and that *directory* has a continuation for the dentry that constitutes the hardlink. But that dentry is "local" to the data. so the directory ends up being split over the domains > Another advantage to this is that inodes never refer to blocks outside > their zone, so we can forget about all this '64-bit block number' crap. exactly the point! ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-01 12:45 ` Matthew Wilcox 2006-06-01 12:53 ` Arjan van de Ven @ 2006-06-01 20:06 ` Russell Cattelan 2006-06-02 11:27 ` Nathan Scott 2 siblings, 0 replies; 24+ messages in thread From: Russell Cattelan @ 2006-06-01 20:06 UTC (permalink / raw) To: Matthew Wilcox Cc: Valerie Henson, Ric Wheeler, linux-fsdevel, Arjan van de Ven Matthew Wilcox wrote: >On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote: > > >>Actually, the continuation inode is in B. When we create a link in >>directory A to file C, a continuation inode for directory A is created >>in domain B, and a block containing the link to file C is allocated >>inside domain B as well. So there is no continuation inode in domain >>A. >> >>That being said, this idea is at the hand-waving stage and probably >>has many other (hopefully non-fatal) flaws. Thanks for taking a look! >> >> > >OK, so we really have two kinds of continuation inodes, and it might be >sensible to name them differently. We have "here's some extra data for >that inode over there" and "here's a hardlink from another domain". I >dub the first one a 'continuation inode' and the second a 'shadow inode'. > >Continuation inodes and shadow inodes both suffer from the problem >that they might be unwittingly orphaned, unless they have some kind of >back-link to their referrer. That seems more soluble though. The domain >B minifsck can check to see if the backlinked inode or directory is >still there. If the domain A minifsck prunes something which has a link >to domain B, it should be able to just remove the continuation/shadow >inode there, without fscking domain B. > >Another advantage to this is that inodes never refer to blocks outside >their zone, so we can forget about all this '64-bit block number' crap. >We don't even need 64-bit inode numbers -- we can use special direntries >for shadow inodes, and inodes which refer to continuation inodes need >a new encoding scheme anyway. Normal inodes would remain 32-bit and >refer to the local domain, and shadow/continuation inode numbers would >be 32-bits of domain, plus 32-bits of inode within that domain. > >So I like this ;-) > > > >>>Surely XFS must have a more elegant solution than this? >>> >>> XFS may be a bit better suited to do this "encapsulated" form of inode/directory management since it's AG's already tried to keep meta data close to the file data. So it would be quite feasible to offline particular AG's and do a consistency check on it. But yes hard links pose the same problem as being discussed here. File data also can span AG's and thus create interdependency of AG's in terms both file data and the meta data blocks that manage the extents. But the idea of idea of creating continuation inodes seem like a good one. For XFS is might be better to do this at the AG level so as soon as a hard link in one AG refers to a inode it another AG the AG's are linked flagged as being linked. This would allow for any form of interdependent data to be grouped (quota's extended attributes etc) >>val@goober:/usr/src/linux-2.6.16.19$ wc -l `find fs/xfs/ -type f` >>[snip] >> 109083 total >> >> > >Well, yes. I think that inside the Linux XFS implementation there's a >small and neat filesystem struggling to get out. Once SGI finally dies, >perhaps we can rip out all the CXFS stubs and IRIX combatability. Then >we might be able to see it. > >For fun, if you're a masochist, try to follow the code flow for >something easy like fsync(). > >const struct file_operations xfs_file_operations = { > .fsync = xfs_file_fsync, >} > >xfs_file_fsync(struct file *filp, struct dentry *dentry, int datasync) >{ > struct inode *inode = dentry->d_inode; > vnode_t *vp = vn_from_inode(inode); > int error; > int flags = FSYNC_WAIT; > > if (datasync) > flags |= FSYNC_DATA; > VOP_FSYNC(vp, flags, NULL, (xfs_off_t)0, (xfs_off_t)-1, error); > return -error; >} > >#define _VOP_(op, vp) (*((vnodeops_t *)(vp)->v_fops)->op) > > Don't forget the extremely hard to untangle behaviors. #define VNHEAD(vp) ((vp)->v_bh.bh_first) #define VOP(op, vp) (*((bhv_vnodeops_t *)VNHEAD(vp)->bd_ops)->op) Which I won't even try to explain cuz they confuse the crap out me. But that is what CXFS uses to create different call chains. Oh and note to make thing even more evil the call chains are dynamically changed based on whether an inode has a client or not. So in the case of no cxfs client the call chain is about the same as local xfs, but when a client come in and cxfs will insert more behaviors / vop's that hooks up all the cluster management stuff for that inode. >#define VOP_FSYNC(vp,f,cr,b,e,rv) \ > rv = _VOP_(vop_fsync, vp)((vp)->v_fbhv,f,cr,b,e) > >vnodeops_t xfs_vnodeops = { > .vop_fsync = xfs_fsync, >} > >Finally, xfs_fsync actually does the work. The best bit about all this >abstraction is that there's only one xfs_vnodeops defined! So this could >all be done with an xfs_file_fsync() that munged its parameters and called >xfs_fsync() directly. That wouldn't even affect IRIX combatability, >but it would make life difficult for CXFS, apparently. > > > So some of my ex co-workers at SGI will disagree with the following but... The VOP's that are left in XFS are completely pointless at this point, since xfs never has anything other than one call chain it shouldn't have to deal with all that stuff in local mode. All the behavior call chaining should be handled by CXFS and thus all the VOP code should be pushed to that code base. I think 4 VOP calls that are used internally by XFS and such the callers of those vop's may need something else that provides away or re-entering the call chain at the top. I have done some of the work in terms of just replacing the VOP calls with straight calls to the final functions in the hopes of tossing the vnodeops out of XFS. And I spec'd out a way of fixing CXFS to deal with the vops internally but unfortunately that kind of work will always fall under the ENORESOURCES category. I know SGI will never take it as long CXFS lives, but maybe someday when the SGI finally fizzles... :-) Ohh and the whole IRIX compat is crap at this point since many of the vop call params have been changed to match linux params. >http://oss.sgi.com/projects/xfs/mail_archive/200308/msg00214.html >- >To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-01 12:45 ` Matthew Wilcox 2006-06-01 12:53 ` Arjan van de Ven 2006-06-01 20:06 ` Russell Cattelan @ 2006-06-02 11:27 ` Nathan Scott 2 siblings, 0 replies; 24+ messages in thread From: Nathan Scott @ 2006-06-02 11:27 UTC (permalink / raw) To: Matthew Wilcox Cc: Valerie Henson, Ric Wheeler, linux-fsdevel, Arjan van de Ven Hi there, On Thu, Jun 01, 2006 at 06:45:17AM -0600, Matthew Wilcox wrote: > On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote: > We don't even need 64-bit inode numbers -- we can use special direntries > for shadow inodes, and inodes which refer to continuation inodes need > a new encoding scheme anyway. Normal inodes would remain 32-bit and > refer to the local domain, and shadow/continuation inode numbers would > be 32-bits of domain, plus 32-bits of inode within that domain. Be careful of 64 bit inode numbers, we have had a world of pain with those due to broken userspace applications that cannot work with larger inode numbers (backup programs, with on-disk formats that only have room for 32 bits). To the point that we're actually having to find new ways to keep our inode number space within the 32 bits-of-significance range. > > > Surely XFS must have a more elegant solution than this? Do you mean to the problem of inodes spanning multiple devices? We don't really have this concept in XFS today, but it remains an area of active interest (and current development). Or are you refering to the problem of scaling in general? Yes, XFS uses many approaches there, it was designed with this in mind. The first step toward an "elegent solution" is to use 64 bit disk addressing, or to at least be able to. > Well, yes. I think that inside the Linux XFS implementation > there's a small and neat filesystem struggling to get out. Hmm, interesting theory. In reality, no - all the complexity in XFS is in core places, that no matter how the code is sliced and diced still need to be done... the allocator, the log code, and the multiple Btrees could all possibly be simpler, but they are all critical pieces of XFS's ability to scale upward, and all have seen many years of tuning under a large variety of workloads. > For fun, if you're a masochist, try to follow the code flow for > something easy like fsync(). > ... > Finally, xfs_fsync actually does the work. The best bit about all this > abstraction is that there's only one xfs_vnodeops defined! No, the best bit is that an abstraction like this is a neat fit to the problems presented by this multiple-devices-within-one-filesystem approach. As I said, this is an area of active interest to us also. cheers. -- Nathan ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-01 2:19 ` Valerie Henson 2006-06-01 2:42 ` Matthew Wilcox @ 2006-06-01 5:36 ` Andreas Dilger 2006-06-03 13:50 ` Ric Wheeler 2 siblings, 0 replies; 24+ messages in thread From: Andreas Dilger @ 2006-06-01 5:36 UTC (permalink / raw) To: Valerie Henson; +Cc: Ric Wheeler, linux-fsdevel, Arjan van de Ven On May 31, 2006 19:19 -0700, Valerie Henson wrote: > I don't think a block group is a good enough fault isolation domain - > think hard links. What I think we need is normal file system > structures when you are referencing stuff inside your fault isolation > domain, and something more complicated if you have to reference stuff > outside. One of Arjan's ideas involves something we're calling > continuation inodes - if the file's data is stored in multiple > domains, it has a separate continuation inode in each domain, and each > continuation inode has all the information necessary to run a full > fsck on the data inside that domain. Similarly, if a directory has a > hard link to a file outside its domain, we'll have to allocate a > continuation inode and dir entry block in the domain containing the > file. The idea is that you can run fsck on a domain without having to > go look outside that domain. You may have to clean up a few things in > other domains, but they are easy to find and don't require an fsck in > other domains. This sounds very much like the approach Lustre has taken for clustered metadata servers (CMD), which was developed as an advanced prototype last year, and is being reimplemented for production now. In "regular" (non-CMD) Lustre there is a single metadata target (MDT) which holds all of the namespace (directories, filenames, inodes), and the inodes have EA metadata that tells users of those files which other storage targets (OSTs) hold the file data (RAID 0 stripe currently). OSTs are completely self-contained ext3 filesystems, as is the MDT. In the prototype CMD Lustre there are multiple metadata targets that make up a single namespace. Generally, each directory and the inodes therein are kept on a single MDT but in the case of large directories (> 64k entries, which are split across MDTs by the hash of the filename), hard links, or renames it is possible to have a cross-MDT inode reference in a directory. The cross-MDT reference is implemented by storing a special dirent in the directory which tells the caller which other MDT actually has the inode. The remote inode itself is held in a private "MDT object" directory so that it has a local filesystem reference and can be looked up by a special filename that is derived from the inode number, and I believe source MDT (either in the filename or the private directory) to keep the link count correct. This allows each MDT filesystem to be internally consIstent, and the cross-MDT dirents are treated by e2fsck much the same as symlinks in the sense that a dangling reference is non-fatal. There is (or at least was a design for the CMD prototype) a second-stage tool which would get a list of cross-MDT references that it could correlate with the MDT object directory inodes on the other MDTs and fix up refcounts or orphaned inodes. In the case of "split directories", which are implemented in order to load-balance metadata operations across multiple MDTs there was also a need to migrate directory entries to other MDTs when the directory splits. That was only done once when the dir grows beyond 64k, in order to limit the number of cross-MDT entries in the directory and to get the added parallelism involved as soon as possible. After the initial split new direntries and their inodes are created together within a single MDT, though there are several directory "stripes" on multiple MDTs running in parallel. The same methods used to do dirent migration were also used for handling renames across directories on multiple MDTs. The basics are that there needs to be separate target filesystem primitives exported for creating and deleting a new inode, and adding or removing an entry from a directory (which are bundled together in the Linux VFS). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-01 2:19 ` Valerie Henson 2006-06-01 2:42 ` Matthew Wilcox 2006-06-01 5:36 ` Andreas Dilger @ 2006-06-03 13:50 ` Ric Wheeler 2006-06-03 14:13 ` Arjan van de Ven 2 siblings, 1 reply; 24+ messages in thread From: Ric Wheeler @ 2006-06-03 13:50 UTC (permalink / raw) To: Valerie Henson; +Cc: linux-fsdevel, Arjan van de Ven Valerie Henson wrote: >On Thu, May 25, 2006 at 02:44:50PM -0700, Ric Wheeler wrote: > > >> (1) repair/fsck time can take hours or even days depending on the >>health of the file system and its underlying disk as well as the number >>of files. This does not work well for large servers and is a disaster >>for "appliances" that need to run these commands buried deep in some >>data center without a person watching... >> (2) most file system performance testing is done on "pristine" file >>systems with very few files. Performance over time, especially with >>very high file counts, suffers very noticeable performance degradation >>with very large file systems. >> (3) very poor fault containment for these very large devices - it >>would be great to be able to ride through a failure of a segment of the >>underlying storage without taking down the whole file system. >> >>The obvious alternative to this is to break up these big disks into >>multiple small file systems, but there again we hit several issues. >> >> > >1 and 3 are some of my main concerns, and what I want to focus a lot >of the workshop discussion on. I view the question as: How do we keep >file system management simple while splitting the underlying storage >into isolated failure domains that can be repaired individually >online? (Say that three times fast.) Just splitting up into multiple >file systems only solves the second problem, and only if you have >forced umount, as you noted. > > > > Any thoughts about what the right semantics are for properly doing a forced unmount and how whether it is doable near term (as opposed to the more strategic/long term issues laid out in this thread) ? ric ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-03 13:50 ` Ric Wheeler @ 2006-06-03 14:13 ` Arjan van de Ven 2006-06-03 15:07 ` Ric Wheeler 0 siblings, 1 reply; 24+ messages in thread From: Arjan van de Ven @ 2006-06-03 14:13 UTC (permalink / raw) To: Ric Wheeler; +Cc: Valerie Henson, linux-fsdevel Ric Wheeler wrote: >> > Any thoughts about what the right semantics are for properly doing a > forced unmount and how whether it is doable near term (as opposed to the > more strategic/long term issues laid out in this thread) ? I would like to ask you take one step back; in the past when I have seen people want "forced unmount" they wanted instead somethings else that they thought (at that point incorrectly) forced unmount would solve. there's a few things an unmount does 1) detach from the namespace (tree) 2) shut down the filesystem to 2a) allow someone else to mount/fsck/etc it 2b) finish stuff up and put it in a known state (clean) 3) shut down IO to a fs for another node to take over (which is what the "incorrectly" is about technically) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: topics for the file system mini-summit 2006-06-03 14:13 ` Arjan van de Ven @ 2006-06-03 15:07 ` Ric Wheeler 0 siblings, 0 replies; 24+ messages in thread From: Ric Wheeler @ 2006-06-03 15:07 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Valerie Henson, linux-fsdevel, trond.myklebust Arjan van de Ven wrote: > Ric Wheeler wrote: > >>> >> Any thoughts about what the right semantics are for properly doing a >> forced unmount and how whether it is doable near term (as opposed to >> the more strategic/long term issues laid out in this thread) ? > > > I would like to ask you take one step back; in the past when I have seen > people want "forced unmount" they wanted instead somethings else that > they > thought (at that point incorrectly) forced unmount would solve. > > there's a few things an unmount does > 1) detach from the namespace (tree) > 2) shut down the filesystem to > 2a) allow someone else to mount/fsck/etc it > 2b) finish stuff up and put it in a known state (clean) > 3) shut down IO to a fs for another node to take over > (which is what the "incorrectly" is about technically) > > We have 20-30 100GB file systems on a single box (to avoid the long fsck time). When you hit an issue with one file system, say a panic, or a set of file systems (dead drive) that might take out 5 file systems, we want to be able to keep the box up since it is still serving up something like 2.5TB of storage to the user ;-) So what I want is all of the following: (1) do your 2a - be able to fsck and repair corruptions and then hopefully remount that file system without a reboot of the box. (2) leave all other file systems (including those of the same type) running without error (good fault isolation). (3) I don't want to try and clean up that file system state - error out any existing IO's, perfectly fine to have processes using get blown away. In effect, treat it (from the file system point of view) just like you would a power outage & reboot. You should replay the journal & recover only after you get the disk back. (4) make sure that a hung disk or panic'ed file system does not prevent an intentional reboot. In a conversation about this with Trond, I think that he has some other specific motivations from an NFS point of view as well. ric ric ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2006-06-07 18:55 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler 2006-05-26 16:48 ` Andreas Dilger 2006-05-27 0:49 ` Ric Wheeler 2006-05-27 14:18 ` Andreas Dilger 2006-05-28 1:44 ` Ric Wheeler 2006-05-29 0:11 ` Matthew Wilcox 2006-05-29 2:07 ` Ric Wheeler 2006-05-29 16:09 ` Andreas Dilger 2006-05-29 19:29 ` Ric Wheeler 2006-05-30 6:14 ` Andreas Dilger 2006-06-07 10:10 ` Stephen C. Tweedie 2006-06-07 14:03 ` Andi Kleen 2006-06-07 18:55 ` Andreas Dilger 2006-06-01 2:19 ` Valerie Henson 2006-06-01 2:42 ` Matthew Wilcox 2006-06-01 3:24 ` Valerie Henson 2006-06-01 12:45 ` Matthew Wilcox 2006-06-01 12:53 ` Arjan van de Ven 2006-06-01 20:06 ` Russell Cattelan 2006-06-02 11:27 ` Nathan Scott 2006-06-01 5:36 ` Andreas Dilger 2006-06-03 13:50 ` Ric Wheeler 2006-06-03 14:13 ` Arjan van de Ven 2006-06-03 15:07 ` Ric Wheeler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).