* [RFD] Incremental fsck
@ 2008-01-08 21:22 Al Boldi
2008-01-08 21:31 ` Alan
2008-01-08 21:41 ` Rik van Riel
0 siblings, 2 replies; 39+ messages in thread
From: Al Boldi @ 2008-01-08 21:22 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-kernel
Andi Kleen wrote:
> Theodore Tso <tytso@mit.edu> writes:
> > Now, there are good reasons for doing periodic checks every N mounts
> > and after M months. And it has to do with PC class hardware. (Ted's
> > aphorism: "PC class hardware is cr*p").
>
> If these reasons are good ones (some skepticism here) then the correct
> way to really handle this would be to do regular background scrubbing
> during runtime; ideally with metadata checksums so that you can actually
> detect all corruption.
>
> But since fsck is so slow and disks are so big this whole thing
> is a ticking time bomb now. e.g. it is not uncommon to require tens
> of minutes or even hours of fsck time and some server that reboots
> only every few months will eat that when it happens to reboot.
> This means you get a quite long downtime.
Has there been some thought about an incremental fsck?
You know, somehow fencing a sub-dir to do an online fsck?
Thanks for some thoughts!
--
Al
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-08 21:22 [RFD] Incremental fsck Al Boldi
@ 2008-01-08 21:31 ` Alan
2008-01-09 9:16 ` Andreas Dilger
2008-01-08 21:41 ` Rik van Riel
1 sibling, 1 reply; 39+ messages in thread
From: Alan @ 2008-01-08 21:31 UTC (permalink / raw)
To: Al Boldi; +Cc: linux-fsdevel, linux-kernel
> Andi Kleen wrote:
>> Theodore Tso <tytso@mit.edu> writes:
>> > Now, there are good reasons for doing periodic checks every N mounts
>> > and after M months. And it has to do with PC class hardware. (Ted's
>> > aphorism: "PC class hardware is cr*p").
>>
>> If these reasons are good ones (some skepticism here) then the correct
>> way to really handle this would be to do regular background scrubbing
>> during runtime; ideally with metadata checksums so that you can actually
>> detect all corruption.
>>
>> But since fsck is so slow and disks are so big this whole thing
>> is a ticking time bomb now. e.g. it is not uncommon to require tens
>> of minutes or even hours of fsck time and some server that reboots
>> only every few months will eat that when it happens to reboot.
>> This means you get a quite long downtime.
>
> Has there been some thought about an incremental fsck?
Is that anything like a cluster fsck? ]:>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-08 21:22 [RFD] Incremental fsck Al Boldi
2008-01-08 21:31 ` Alan
@ 2008-01-08 21:41 ` Rik van Riel
2008-01-09 4:40 ` Al Boldi
1 sibling, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2008-01-08 21:41 UTC (permalink / raw)
To: Al Boldi; +Cc: linux-fsdevel, linux-kernel
On Wed, 9 Jan 2008 00:22:55 +0300
Al Boldi <a1426z@gawab.com> wrote:
> Has there been some thought about an incremental fsck?
>
> You know, somehow fencing a sub-dir to do an online fsck?
Search for "chunkfs"
--
All rights reversed.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-08 21:41 ` Rik van Riel
@ 2008-01-09 4:40 ` Al Boldi
2008-01-09 7:45 ` Valerie Henson
2008-01-09 8:04 ` Valdis.Kletnieks
0 siblings, 2 replies; 39+ messages in thread
From: Al Boldi @ 2008-01-09 4:40 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-fsdevel, linux-kernel
Rik van Riel wrote:
> Al Boldi <a1426z@gawab.com> wrote:
> > Has there been some thought about an incremental fsck?
> >
> > You know, somehow fencing a sub-dir to do an online fsck?
>
> Search for "chunkfs"
Sure, and there is TileFS too.
But why wouldn't it be possible to do this on the current fs infrastructure,
using just a smart fsck, working incrementally on some sub-dir?
Thanks!
--
Al
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-09 4:40 ` Al Boldi
@ 2008-01-09 7:45 ` Valerie Henson
2008-01-09 11:52 ` Al Boldi
2008-01-09 8:04 ` Valdis.Kletnieks
1 sibling, 1 reply; 39+ messages in thread
From: Valerie Henson @ 2008-01-09 7:45 UTC (permalink / raw)
To: Al Boldi; +Cc: Rik van Riel, linux-fsdevel, linux-kernel
On Jan 8, 2008 8:40 PM, Al Boldi <a1426z@gawab.com> wrote:
> Rik van Riel wrote:
> > Al Boldi <a1426z@gawab.com> wrote:
> > > Has there been some thought about an incremental fsck?
> > >
> > > You know, somehow fencing a sub-dir to do an online fsck?
> >
> > Search for "chunkfs"
>
> Sure, and there is TileFS too.
>
> But why wouldn't it be possible to do this on the current fs infrastructure,
> using just a smart fsck, working incrementally on some sub-dir?
Several data structures are file system wide and require finding every
allocated file and block to check that they are correct. In
particular, block and inode bitmaps can't be checked per subdirectory.
http://infohost.nmt.edu/~val/review/chunkfs.pdf
-VAL
-VAL
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-09 4:40 ` Al Boldi
2008-01-09 7:45 ` Valerie Henson
@ 2008-01-09 8:04 ` Valdis.Kletnieks
1 sibling, 0 replies; 39+ messages in thread
From: Valdis.Kletnieks @ 2008-01-09 8:04 UTC (permalink / raw)
To: Al Boldi; +Cc: Rik van Riel, linux-fsdevel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 914 bytes --]
On Wed, 09 Jan 2008 07:40:12 +0300, Al Boldi said:
> But why wouldn't it be possible to do this on the current fs infrastructure,
> using just a smart fsck, working incrementally on some sub-dir?
If you have /home/usera, /home/userb, and /home/userc, the vast majority of
fs screw-ups can't be detected by only looking at one sub-dir. For example,
you can't tell definitively that all blocks referenced by an inode under
/home/usera are properly only allocated to one file until you *also* look at
the inodes under user[bc]. Heck, you can't even tell if the link count for
a file is correct unless you walk the entire filesystem - you can find a file
with a link count of 3 in the inode, and you find one reference under usera,
and a second under userb - you can't tell if the count is one too high or
not until you walk through userc and actually see (or fail to see) a third
directory entry referencing it.
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-08 21:31 ` Alan
@ 2008-01-09 9:16 ` Andreas Dilger
2008-01-12 23:55 ` Daniel Phillips
0 siblings, 1 reply; 39+ messages in thread
From: Andreas Dilger @ 2008-01-09 9:16 UTC (permalink / raw)
To: Alan; +Cc: Al Boldi, linux-fsdevel, linux-kernel
Andi Kleen wrote:
>> Theodore Tso <tytso@mit.edu> writes:
>> > Now, there are good reasons for doing periodic checks every N mounts
>> > and after M months. And it has to do with PC class hardware. (Ted's
>> > aphorism: "PC class hardware is cr*p").
>>
>> If these reasons are good ones (some skepticism here) then the correct
>> way to really handle this would be to do regular background scrubbing
>> during runtime; ideally with metadata checksums so that you can actually
>> detect all corruption.
>>
>> But since fsck is so slow and disks are so big this whole thing
>> is a ticking time bomb now. e.g. it is not uncommon to require tens
>> of minutes or even hours of fsck time and some server that reboots
>> only every few months will eat that when it happens to reboot.
>> This means you get a quite long downtime.
>
> Has there been some thought about an incremental fsck?
While an _incremental_ fsck isn't so easy for existing filesystem types,
what is pretty easy to automate is making a read-only snapshot of a
filesystem via LVM/DM and then running e2fsck against that. The kernel
and filesystem have hooks to flush the changes from cache and make the
on-disk state consistent.
You can then set the the ext[234] superblock mount count and last check
time via tune2fs if all is well, or schedule an outage if there are
inconsistencies found.
There is a copy of this script at:
http://osdir.com/ml/linux.lvm.devel/2003-04/msg00001.html
Note that it might need some tweaks to run with DM/LVM2 commands/output,
but is mostly what is needed.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-09 7:45 ` Valerie Henson
@ 2008-01-09 11:52 ` Al Boldi
2008-01-09 14:44 ` Rik van Riel
2008-01-12 14:51 ` Theodore Tso
0 siblings, 2 replies; 39+ messages in thread
From: Al Boldi @ 2008-01-09 11:52 UTC (permalink / raw)
To: Valerie Henson; +Cc: Rik van Riel, linux-fsdevel, linux-kernel
Valerie Henson wrote:
> On Jan 8, 2008 8:40 PM, Al Boldi <a1426z@gawab.com> wrote:
> > Rik van Riel wrote:
> > > Al Boldi <a1426z@gawab.com> wrote:
> > > > Has there been some thought about an incremental fsck?
> > > >
> > > > You know, somehow fencing a sub-dir to do an online fsck?
> > >
> > > Search for "chunkfs"
> >
> > Sure, and there is TileFS too.
> >
> > But why wouldn't it be possible to do this on the current fs
> > infrastructure, using just a smart fsck, working incrementally on some
> > sub-dir?
>
> Several data structures are file system wide and require finding every
> allocated file and block to check that they are correct. In
> particular, block and inode bitmaps can't be checked per subdirectory.
Ok, but let's look at this a bit more opportunistic / optimistic.
Even after a black-out shutdown, the corruption is pretty minimal, using
ext3fs at least. So let's take advantage of this fact and do an optimistic
fsck, to assure integrity per-dir, and assume no external corruption. Then
we release this checked dir to the wild (optionally ro), and check the next.
Once we find external inconsistencies we either fix it unconditionally,
based on some preconfigured actions, or present the user with options.
All this could be per-dir or using some form of on-the-fly file-block-zoning.
And there probably is a lot more to it, but it should conceptually be
possible, with more thoughts though...
> http://infohost.nmt.edu/~val/review/chunkfs.pdf
Thanks!
--
Al
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-09 11:52 ` Al Boldi
@ 2008-01-09 14:44 ` Rik van Riel
2008-01-10 13:26 ` Al Boldi
2008-01-12 14:51 ` Theodore Tso
1 sibling, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2008-01-09 14:44 UTC (permalink / raw)
To: Al Boldi; +Cc: Valerie Henson, linux-fsdevel, linux-kernel
On Wed, 9 Jan 2008 14:52:14 +0300
Al Boldi <a1426z@gawab.com> wrote:
> Ok, but let's look at this a bit more opportunistic / optimistic.
You can't play fast and loose with data integrity.
Besides, if we looked at things optimistically, we would conclude
that no fsck will be needed, ever :)
> > http://infohost.nmt.edu/~val/review/chunkfs.pdf
You will really want to read this paper, if you haven't already.
--
All Rights Reversed
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-09 14:44 ` Rik van Riel
@ 2008-01-10 13:26 ` Al Boldi
0 siblings, 0 replies; 39+ messages in thread
From: Al Boldi @ 2008-01-10 13:26 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-kernel
Rik van Riel wrote:
> Al Boldi <a1426z@gawab.com> wrote:
> > Ok, but let's look at this a bit more opportunistic / optimistic.
>
> You can't play fast and loose with data integrity.
Correct, but you have to be realistic...
> Besides, if we looked at things optimistically, we would conclude
> that no fsck will be needed,
And that's the reality, because people are mostly optimistic and feel
extremely tempted to just force-mount a dirty ext3fs, instead of waiting
hours-on-end for a complete fsck, which mostly comes back with some benign
"inode should be zero" warning.
> ever :)
Well not ever, but most people probably fsck during scheduled shutdowns, or
when they are forced to, due to online fs accessibility errors.
> > > http://infohost.nmt.edu/~val/review/chunkfs.pdf
>
> You will really want to read this paper, if you haven't already.
Definitely a good read, but attacking the problem from a completely different
POV.
BTW: Dropped some cc's due to bounces.
Thanks!
--
Al
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
[not found] ` <9JHLl-2dL-1@gated-at.bofh.it>
@ 2008-01-11 14:20 ` Bodo Eggert
2008-01-12 10:20 ` Al Boldi
0 siblings, 1 reply; 39+ messages in thread
From: Bodo Eggert @ 2008-01-11 14:20 UTC (permalink / raw)
To: Al Boldi, Valerie Henson, Rik van Riel, linux-fsdevel,
linux-kernel
Al Boldi <a1426z@gawab.com> wrote:
> Even after a black-out shutdown, the corruption is pretty minimal, using
> ext3fs at least. So let's take advantage of this fact and do an optimistic
> fsck, to assure integrity per-dir, and assume no external corruption. Then
> we release this checked dir to the wild (optionally ro), and check the next.
> Once we find external inconsistencies we either fix it unconditionally,
> based on some preconfigured actions, or present the user with options.
Maybe we can know the changes that need to be done in order to fix the
filesystem. Let's record this information in - eh - let's call it a journal!
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-11 14:20 ` Bodo Eggert
@ 2008-01-12 10:20 ` Al Boldi
0 siblings, 0 replies; 39+ messages in thread
From: Al Boldi @ 2008-01-12 10:20 UTC (permalink / raw)
To: 7eggert; +Cc: linux-fsdevel, linux-kernel
Bodo Eggert wrote:
> Al Boldi <a1426z@gawab.com> wrote:
> > Even after a black-out shutdown, the corruption is pretty minimal, using
> > ext3fs at least. So let's take advantage of this fact and do an
> > optimistic fsck, to assure integrity per-dir, and assume no external
> > corruption. Then we release this checked dir to the wild (optionally
> > ro), and check the next. Once we find external inconsistencies we either
> > fix it unconditionally, based on some preconfigured actions, or present
> > the user with options.
>
> Maybe we can know the changes that need to be done in order to fix the
> filesystem. Let's record this information in - eh - let's call it a
> journal!
Don't mistake data=journal as an fsck replacement.
Thanks!
--
Al
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-09 11:52 ` Al Boldi
2008-01-09 14:44 ` Rik van Riel
@ 2008-01-12 14:51 ` Theodore Tso
2008-01-13 11:05 ` Al Boldi
` (2 more replies)
1 sibling, 3 replies; 39+ messages in thread
From: Theodore Tso @ 2008-01-12 14:51 UTC (permalink / raw)
To: Al Boldi; +Cc: Valerie Henson, Rik van Riel, linux-fsdevel, linux-kernel
On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote:
>
> Ok, but let's look at this a bit more opportunistic / optimistic.
>
> Even after a black-out shutdown, the corruption is pretty minimal, using
> ext3fs at least.
>
After a unclean shutdown, assuming you have decent hardware that
doesn't lie about when blocks hit iron oxide, you shouldn't have any
corruption at all. If you have crappy hardware, then all bets are off....
> So let's take advantage of this fact and do an optimistic fsck, to
> assure integrity per-dir, and assume no external corruption. Then
> we release this checked dir to the wild (optionally ro), and check
> the next. Once we find external inconsistencies we either fix it
> unconditionally, based on some preconfigured actions, or present the
> user with options.
So what can you check? The *only* thing you can check is whether or
not the directory syntax looks sane, whether the inode structure looks
sane, and whether or not the blocks reported as belong to an inode
looks sane.
What is very hard to check is whether or not the link count on the
inode is correct. Suppose the link count is 1, but there are actually
two directory entries pointing at it. Now when someone unlinks the
file through one of the directory hard entries, the link count will go
to zero, and the blocks will start to get reused, even though the
inode is still accessible via another pathname. Oops. Data Loss.
This is why doing incremental, on-line fsck'ing is *hard*. You're not
going to find this while doing each directory one at a time, and if
the filesystem is changing out from under you, it gets worse. And
it's not just the hard link count. There is a similar issue with the
block allocation bitmap. Detecting the case where two files are
simultaneously can't be done if you are doing it incrementally, and if
the filesystem is changing out from under you, it's impossible, unless
you also have the filesystem telling you every single change while it
is happening, and you keep an insane amount of bookkeeping.
One that you *might* be able to do, is to mount a filesystem readonly,
check it in the background while you allow users to access it
read-only. There are a few caveats, however ---- (1) some filesystem
errors may cause the data to be corrupt, or in the worst case, could
cause the system to panic (that's would arguably be a
filesystem/kernel bug, but we've not necessarily done as much testing
here as we should.) (2) if there were any filesystem errors found,
you would beed to completely unmount the filesystem to flush the inode
cache and remount it before it would be safe to remount the filesystem
read/write. You can't just do a "mount -o remount" if the filesystem
was modified under the OS's nose.
> All this could be per-dir or using some form of on-the-fly file-block-zoning.
>
> And there probably is a lot more to it, but it should conceptually be
> possible, with more thoughts though...
Many things are possible, in the NASA sense of "with enough thrust,
anything will fly". Whether or not it is *useful* and *worthwhile*
are of course different questions! :-)
- Ted
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-09 9:16 ` Andreas Dilger
@ 2008-01-12 23:55 ` Daniel Phillips
0 siblings, 0 replies; 39+ messages in thread
From: Daniel Phillips @ 2008-01-12 23:55 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Alan, Al Boldi, linux-fsdevel, linux-kernel
On Wednesday 09 January 2008 01:16, Andreas Dilger wrote:
> While an _incremental_ fsck isn't so easy for existing filesystem
> types, what is pretty easy to automate is making a read-only snapshot
> of a filesystem via LVM/DM and then running e2fsck against that. The
> kernel and filesystem have hooks to flush the changes from cache and
> make the on-disk state consistent.
>
> You can then set the the ext[234] superblock mount count and last
> check time via tune2fs if all is well, or schedule an outage if there
> are inconsistencies found.
>
> There is a copy of this script at:
> http://osdir.com/ml/linux.lvm.devel/2003-04/msg00001.html
>
> Note that it might need some tweaks to run with DM/LVM2
> commands/output, but is mostly what is needed.
You can do this now with ddsnap (an out-of-tree device mapper target)
either by checking a local snapshot or a replicated snapshot on a
different machine, see:
http://zumastor.org/
Doing the check on a remote machine seems attractive because the fsck
does not create a load on the server.
Regards,
Daniel
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-12 14:51 ` Theodore Tso
@ 2008-01-13 11:05 ` Al Boldi
2008-01-13 17:19 ` Pavel Machek
2008-01-14 0:22 ` Daniel Phillips
2 siblings, 0 replies; 39+ messages in thread
From: Al Boldi @ 2008-01-13 11:05 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-fsdevel, linux-kernel
Theodore Tso wrote:
> On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote:
> > Ok, but let's look at this a bit more opportunistic / optimistic.
> >
> > Even after a black-out shutdown, the corruption is pretty minimal, using
> > ext3fs at least.
>
> After a unclean shutdown, assuming you have decent hardware that
> doesn't lie about when blocks hit iron oxide, you shouldn't have any
> corruption at all. If you have crappy hardware, then all bets are off....
Maybe with barriers...
> > So let's take advantage of this fact and do an optimistic fsck, to
> > assure integrity per-dir, and assume no external corruption. Then
> > we release this checked dir to the wild (optionally ro), and check
> > the next. Once we find external inconsistencies we either fix it
> > unconditionally, based on some preconfigured actions, or present the
> > user with options.
>
> So what can you check? The *only* thing you can check is whether or
> not the directory syntax looks sane, whether the inode structure looks
> sane, and whether or not the blocks reported as belong to an inode
> looks sane.
Which would make this dir/area ready for read/write access.
> What is very hard to check is whether or not the link count on the
> inode is correct. Suppose the link count is 1, but there are actually
> two directory entries pointing at it. Now when someone unlinks the
> file through one of the directory hard entries, the link count will go
> to zero, and the blocks will start to get reused, even though the
> inode is still accessible via another pathname. Oops. Data Loss.
We could buffer this, and only actually overwrite when we are completely
finished with the fsck.
> This is why doing incremental, on-line fsck'ing is *hard*. You're not
> going to find this while doing each directory one at a time, and if
> the filesystem is changing out from under you, it gets worse. And
> it's not just the hard link count. There is a similar issue with the
> block allocation bitmap. Detecting the case where two files are
> simultaneously can't be done if you are doing it incrementally, and if
> the filesystem is changing out from under you, it's impossible, unless
> you also have the filesystem telling you every single change while it
> is happening, and you keep an insane amount of bookkeeping.
Ok, you have a point, so how about we change the implementation detail a bit,
from external fsck to internal fsck, leveraging the internal fs bookkeeping,
while allowing immediate but controlled read/write access.
Thanks for more thoughts!
--
Al
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-12 14:51 ` Theodore Tso
2008-01-13 11:05 ` Al Boldi
@ 2008-01-13 17:19 ` Pavel Machek
2008-01-13 17:41 ` Alan Cox
2008-01-15 1:04 ` [RFD] Incremental fsck Ric Wheeler
2008-01-14 0:22 ` Daniel Phillips
2 siblings, 2 replies; 39+ messages in thread
From: Pavel Machek @ 2008-01-13 17:19 UTC (permalink / raw)
To: Theodore Tso, Al Boldi, Valerie Henson, Rik van Riel,
linux-fsdevel, linux-kernel
On Sat 2008-01-12 09:51:40, Theodore Tso wrote:
> On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote:
> >
> > Ok, but let's look at this a bit more opportunistic / optimistic.
> >
> > Even after a black-out shutdown, the corruption is pretty minimal, using
> > ext3fs at least.
> >
>
> After a unclean shutdown, assuming you have decent hardware that
> doesn't lie about when blocks hit iron oxide, you shouldn't have any
> corruption at all. If you have crappy hardware, then all bets are off....
What hardware is crappy here. Lets say... internal hdd in thinkpad
x60?
What are ext3 expectations of disk (is there doc somewhere)? For
example... if disk does not lie, but powerfail during write damages
the sector -- is ext3 still going to work properly?
If disk does not lie, but powerfail during write may cause random
numbers to be returned on read -- can fsck handle that?
What abou disk that kills 5 sectors around sector being written during
powerfail; can ext3 survive that?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-13 17:19 ` Pavel Machek
@ 2008-01-13 17:41 ` Alan Cox
2008-01-15 20:16 ` [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck) Pavel Machek
2008-01-15 1:04 ` [RFD] Incremental fsck Ric Wheeler
1 sibling, 1 reply; 39+ messages in thread
From: Alan Cox @ 2008-01-13 17:41 UTC (permalink / raw)
To: Pavel Machek
Cc: Theodore Tso, Al Boldi, Valerie Henson, Rik van Riel,
linux-fsdevel, linux-kernel
> What are ext3 expectations of disk (is there doc somewhere)? For
> example... if disk does not lie, but powerfail during write damages
> the sector -- is ext3 still going to work properly?
Nope. However the few disks that did this rapidly got firmware updates
because there are other OS's that can't cope.
> If disk does not lie, but powerfail during write may cause random
> numbers to be returned on read -- can fsck handle that?
most of the time. and fsck knows about writing sectors to remove read
errors in metadata blocks.
> What abou disk that kills 5 sectors around sector being written during
> powerfail; can ext3 survive that?
generally. Note btw that for added fun there is nothing that guarantees
the blocks around a block on the media are sequentially numbered. The
usually are but you never know.
Alan
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-12 14:51 ` Theodore Tso
2008-01-13 11:05 ` Al Boldi
2008-01-13 17:19 ` Pavel Machek
@ 2008-01-14 0:22 ` Daniel Phillips
2 siblings, 0 replies; 39+ messages in thread
From: Daniel Phillips @ 2008-01-14 0:22 UTC (permalink / raw)
To: Theodore Tso
Cc: Al Boldi, Valerie Henson, Rik van Riel, linux-fsdevel,
linux-kernel
Hi Ted,
On Saturday 12 January 2008 06:51, Theodore Tso wrote:
> What is very hard to check is whether or not the link count on the
> inode is correct. Suppose the link count is 1, but there are
> actually two directory entries pointing at it. Now when someone
> unlinks the file through one of the directory hard entries, the link
> count will go to zero, and the blocks will start to get reused, even
> though the inode is still accessible via another pathname. Oops.
> Data Loss.
>
> This is why doing incremental, on-line fsck'ing is *hard*. You're
> not going to find this while doing each directory one at a time, and
> if the filesystem is changing out from under you, it gets worse. And
> it's not just the hard link count. There is a similar issue with the
> block allocation bitmap. Detecting the case where two files are
> simultaneously can't be done if you are doing it incrementally, and
> if the filesystem is changing out from under you, it's impossible,
> unless you also have the filesystem telling you every single change
> while it is happening, and you keep an insane amount of bookkeeping.
In this case I am listening to Chicken Little carefully and really do
believe the sky will fall if we fail to come up with an incremental
online fsck some time in the next few years. I realize the challenge
verges on insane, but I have been slowly chewing away at this question
for some time.
Val proposes to simplify the problem by restricting the scope of block
pointers and hard links. Best of luck with that, the concept of fault
isolation domains has a nice ring to it. I prefer to stick close to
tried and true Ext3 and not change the basic algorithms.
Rather than restricting pointers, I propose to add a small amount of new
metadata to accelerate global checking. The idea is to be able to
build per-group reverse maps very quickly, to support mapping physical
blocks back to inodes that own them, and mapping inodes back to the
directories that reference them.
I see on-the-fly filesystem reverse mapping as useful for more than just
online fsck. For example it would be nice to be able to work backwards
efficiently from a list of changed blocks such as ddsnap produces to a
list of file level changes.
The amount of metadata required to support efficient on-the-fly reverse
mapping is surprisingly small: 2K per block group per terabyte, in a
fixed location at the base of each group. This is consistent with my
goal of producing code that is mergable for Ext4 and backportable to
Ext3.
Building a block reverse map for a given group is easy and efficient.
The first pass walks across the inode table and already maps most of
the physical blocks for typical usage patterns, because most files only
have direct pointers. Index blocks discovered in the first pass go
onto a list to be processed by subsequent passes, which may discover
additional index blocks. Just keep pushing the index blocks back onto
the list and the algorithm terminates when the list is empty. This
builds a reverse map for the group including references to external
groups.
Note that the recent metadata clustering patch from Abhishek Rai will
speed up this group mapping algorithm significantly because (almost)
all the index blocks can be picked up in one linear read. This should
only take a few milliseconds. One more reason why I think his patch is
an Important Patch[tm].
A data block may be up to four groups removed from its home group,
therefore the reverse mapping process must follow pointers across
groups and map each file entirely to be sure that all pointers to the
group being checked have been discovered. It is possible to construct
a case where a group contains a lot of inodes of big files that are
mostly stored in other groups. Mapping such a group could possibly
require examining all the index blocks on the entire volume. That
would be about 2**18 index blocks per terabyte, which is still within
the realm of practicality.
To generate the inode reverse map for, walk each directory in the group,
decoding the index blocks by hand. Strictly speaking, directories
ought to pass block level checking before being reverse mapped, but
there could be many directories in the same group spilling over into a
lot of external groups, so getting all the directory inodes to pass
block level checks at the same time could be difficult with filesystem
writing going on between fsck episodes. Instead, just go ahead and
assume a directory file is ok, and if this is not the case the
directory walk will fail or a block level check will eventually pick up
the problem.
The worst case for directory mapping is much worse than the worst case
for block mapping. A single directory could fill an entire volume.
For such a large directory, reverse mapping is not possible without
keeping the filesystem suspended for an unreasonable time. Either make
the reverse map incremental and maintained on the fly or fall back to a
linear search of the entire directory when doing the checks below, the
latter being easy but very slow. Or just give up on fscking groups
involving the directory. Or maybe I am obsessing about this too much,
because mapping a directory of a million files only requires reading
about 60 MB, and such large directories are very rare.
The group cross reference tables have to be persistently recorded on
disk in order to avoid searching the whole volume for some checks. A
per group bitmap handles this nicely, with as many bits as there are
block groups. Each one bit flags some external group as referencing
the group in which the bitmap is stored. With default settings, a
cross reference bitmap is only 1K per terabyte. Two such bitmaps are
needed per group, one for external block pointers and the other for
external hard links. When needed for processing, a bitmap is converted
into a list or hash table. New cross group references need to be
detected in the filesystem code and saved to disk before the associated
transaction proceeds. Though new cross group references should be
relatively rare, the cross reference bitmaps can be logically
journalled in the commit block of the associated transaction and
batch-updated on journal flush so that there is very little new write
overhead.
Cross group references may be updated lazily on delete, since the only
harm caused by false positives is extra time spent building unneeded
reverse maps. The incremental fsck step "check group cross reference
bitmaps" describes how redundant cross reference bits are detected.
These can be batched up in memory and updated as convenient.
Cached reverse maps can be disturbed by write activity on the
filesystem. The lazy approach is to discard them on any change, which
should work well enough to get started. With additional work, cached
reverse maps can be updated on the fly by the respective get_block,
truncate and directory entry operations.
The incremental fsck algorithm works by checking a volume one block
group at a time, with filesystem operations suspended during the check.
The expected service interruption will be small compared to taking the
volume offline but will occur more often, which might be an issue for
interactive use. Making the algorithm completely bumpless would
require something like a temporary in-memory volume snapshot, followed
by a clever merge of the changed blocks, taking advantage of the
reverse maps to know which checked groups need to be changed back to
unchecked. Beyond the scope of the current effort.
Broadly, there are two layers of integrity to worry about:
1) Block pointers (block level)
2) Directory entries (inode level)
Luckily for lazy programmers, similar techniques and even identical data
structures work for both. Each group has one persistent bitmap to
reference the group via block pointers, and another to show which
groups reference the group via directory entries. In memory, there is
one cached reverse map per group to map blocks to inodes (one to one),
and another to map inodes to directory inodes (one to many).
Algorithms for block and inode level checks are similar, as detailed
below.
With on-demand reverse maps to help us, we do something like:
* Suspend filesystem, flushing dirty page cache to disk
* Build reverse map for this group if needed
* Build reverse maps for groups referencing this group as needed
* Perform checks listed below
* If the checks passed mark the group as checked
* Resume filesystem
The order in which groups are checked can depend opportunistically on
which reverse maps are already cached. Some kind of userspace
interface would let the operator know about checking progress and the
nature of problems detected. Each group records the time last checked,
the time checked successfully, and a few bits indicating the nature of
problems found.
Now for the specific integrity checks, and strategy for each. There are
two interesting kinds of potentially nonlocal checks:
Downward check: this group may reference other groups that must be
examined together with this group to complete the check. Can be
completed immediately when all references are local, otherwise
make a list of groups needed to continue the check and defer
until convenient to map those groups all at the same time.
Upward check: this group may be referenced by other groups that must
be examined together with this group in order to complete the
check. Need to check the maps of all groups referencing this
group to find incoming references.
To prepare for checking a group, its reverse map is constructed if not
already cached, and the reverse maps for all groups marked as
referencing the group. If that is too many reverse maps then just give
up on trying to fsck that group, or do it very slowly by constructing
each reverse map at the point it is actually needed in a check.
As online fsck works its way through groups a list of pending downward
checks for certain inodes will build up. When this list gets long
enough, find a subset of it involving a reasonably small number of
groups, map those groups and perform the needed checks.
A list of checks and correspondence to e2fsck passes follows.
Inode mode field size and block count (e2fsck pass 1)
Downward check. Do the local inode checks, then walk the inode index
structure counting the blocks. Ensure that the rightmost block lies
within the inode size.
Check block references (e2fsck pass 1)
Upward check. Check that each block found to be locally referenced is
not marked free in the block bitmap. For each block found to have no
local reference, check the maps of the groups referencing this group to
ensure that exactly one of them points at the block, or none if the
block is marked free in the group bitmap.
Check directory structure (e2fsck pass 2)
Downward check. The same set of directory structure tests as e2fsck,
such as properly formed directory entries, htree nodes, etc.
Check directory inode links (e2fsck pass 3)
Upward check. While walking directory entries, ensure that each
directory inode to be added to the reverse map is not already in the
map and is not marked free in the inode bitmap. For each inode
discovered to have no local link after building the reverse map, check
the reverse maps of the groups referring to this group to ensure that
exactly one of them links to the inode, or that there are no external
links if the block bitmap indicates the block is free.
Check inode reference counts (e2fsck pass 4)
Upward check. While walking directory entries, ensure that each non
directory inode to be added to the reverse map is not marked free in
the inode bitmap. Check that the inode reference count is equal to the
number of references to the inode found in the local reverse map plus
the number of references found in the maps of all groups referencing
this group.
Check block bitmaps (e2fsck pass 5)
Checking block references above includes ensuring that no block in use
is marked free. Now check that no block marked free in the block
bitmap appears in the local or external block reverse map.
Check inode bitmaps (e2fsck pass 5)
Checking inode references above includes ensuring that no inode in use
is marked free. Now check that no inode marked free in the inode
bitmap appears in the local or external inode reverse map.
Check group cross reference bitmaps
Each time a group is mapped, check that for each external reference
discovered the corresponding bit is set in the external bitmap. Check
that for all groups having an external reference bit set for this
group, this group does in fact reference the external group. Because
cross reference bitmaps are so small they should all fit in cache
comfortably. The buffer cache is ideal for this.
Finally...
Total work required to get something working along these lines looks
significant, but the importance is high, so work is quite likely to go
ahead if the approach survives scrutiny.
Regards,
Daniel
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFD] Incremental fsck
2008-01-13 17:19 ` Pavel Machek
2008-01-13 17:41 ` Alan Cox
@ 2008-01-15 1:04 ` Ric Wheeler
1 sibling, 0 replies; 39+ messages in thread
From: Ric Wheeler @ 2008-01-15 1:04 UTC (permalink / raw)
To: Pavel Machek
Cc: Theodore Tso, Al Boldi, Valerie Henson, Rik van Riel,
linux-fsdevel, linux-kernel
Pavel Machek wrote:
> On Sat 2008-01-12 09:51:40, Theodore Tso wrote:
>> On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote:
>>> Ok, but let's look at this a bit more opportunistic / optimistic.
>>>
>>> Even after a black-out shutdown, the corruption is pretty minimal, using
>>> ext3fs at least.
>>>
>> After a unclean shutdown, assuming you have decent hardware that
>> doesn't lie about when blocks hit iron oxide, you shouldn't have any
>> corruption at all. If you have crappy hardware, then all bets are off....
>
> What hardware is crappy here. Lets say... internal hdd in thinkpad
> x60?
>
> What are ext3 expectations of disk (is there doc somewhere)? For
> example... if disk does not lie, but powerfail during write damages
> the sector -- is ext3 still going to work properly?
>
> If disk does not lie, but powerfail during write may cause random
> numbers to be returned on read -- can fsck handle that?
>
> What abou disk that kills 5 sectors around sector being written during
> powerfail; can ext3 survive that?
>
> Pavel
>
I think that you have to keep in mind the way disk (and other media)
fail. You can get media failures after a successful write or errors that
pop up as the media ages.
Not to mention the way most people run with write cache enabled and no
write barriers enabled - a sure recipe for corruption.
Of course, there are always software errors to introduce corruption even
when we get everything else right ;-)
From what I see, media errors are the number one cause of corruption in
file systems. It is critical that fsck (and any other tools) continue
after an IO error since they are fairly common (just assume that sector
is lost and do your best as you continue on).
ric
^ permalink raw reply [flat|nested] 39+ messages in thread
* [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-13 17:41 ` Alan Cox
@ 2008-01-15 20:16 ` Pavel Machek
2008-01-15 21:43 ` David Chinner
2008-01-16 1:44 ` Daniel Phillips
0 siblings, 2 replies; 39+ messages in thread
From: Pavel Machek @ 2008-01-15 20:16 UTC (permalink / raw)
To: Alan Cox
Cc: Theodore Tso, Al Boldi, Valerie Henson, Rik van Riel,
linux-fsdevel, linux-kernel
Hi!
> > What are ext3 expectations of disk (is there doc somewhere)? For
> > example... if disk does not lie, but powerfail during write damages
> > the sector -- is ext3 still going to work properly?
>
> Nope. However the few disks that did this rapidly got firmware updates
> because there are other OS's that can't cope.
>
> > If disk does not lie, but powerfail during write may cause random
> > numbers to be returned on read -- can fsck handle that?
>
> most of the time. and fsck knows about writing sectors to remove read
> errors in metadata blocks.
>
> > What abou disk that kills 5 sectors around sector being written during
> > powerfail; can ext3 survive that?
>
> generally. Note btw that for added fun there is nothing that guarantees
> the blocks around a block on the media are sequentially numbered. The
> usually are but you never know.
Ok, should something like this be added to the documentation?
It would be cool to be able to include few examples (modern SATA disks
support bariers so are safe, any IDE from 1989 is unsafe), but I do
not know enough about hw...
Signed-off-by: Pavel Machek <pavel@suse.cz>
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index b45f3c1..adfcc9d 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -183,6 +183,18 @@ mke2fs: create a ext3 partition with th
debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer
+Requirements
+============
+
+Ext3 needs disk that does not do write-back caching or disk that
+supports barriers and Linux configuration that can use them.
+
+* if disk damages the sector being written during powerfail, ext3
+ can't cope with that. Fortunately, such disks got firmware updates
+ to fix this long time ago.
+
+* if disk writes random data during powerfail, ext3 should survive
+ that most of the time.
References
==========
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-15 20:16 ` [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck) Pavel Machek
@ 2008-01-15 21:43 ` David Chinner
2008-01-15 23:07 ` Pavel Machek
2008-01-16 16:38 ` Christoph Hellwig
2008-01-16 1:44 ` Daniel Phillips
1 sibling, 2 replies; 39+ messages in thread
From: David Chinner @ 2008-01-15 21:43 UTC (permalink / raw)
To: Pavel Machek
Cc: Alan Cox, Theodore Tso, Al Boldi, Valerie Henson, Rik van Riel,
linux-fsdevel, linux-kernel
On Tue, Jan 15, 2008 at 09:16:53PM +0100, Pavel Machek wrote:
> Hi!
>
> > > What are ext3 expectations of disk (is there doc somewhere)? For
> > > example... if disk does not lie, but powerfail during write damages
> > > the sector -- is ext3 still going to work properly?
> >
> > Nope. However the few disks that did this rapidly got firmware updates
> > because there are other OS's that can't cope.
> >
> > > If disk does not lie, but powerfail during write may cause random
> > > numbers to be returned on read -- can fsck handle that?
> >
> > most of the time. and fsck knows about writing sectors to remove read
> > errors in metadata blocks.
> >
> > > What abou disk that kills 5 sectors around sector being written during
> > > powerfail; can ext3 survive that?
> >
> > generally. Note btw that for added fun there is nothing that guarantees
> > the blocks around a block on the media are sequentially numbered. The
> > usually are but you never know.
>
> Ok, should something like this be added to the documentation?
>
> It would be cool to be able to include few examples (modern SATA disks
> support bariers so are safe, any IDE from 1989 is unsafe), but I do
> not know enough about hw...
ext3 is not the only filesystem that will have trouble due to
volatile write caches. We see problems often enough with XFS
due to volatile write caches that it's in our FAQ:
http://oss.sgi.com/projects/xfs/faq.html#wcache
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-15 21:43 ` David Chinner
@ 2008-01-15 23:07 ` Pavel Machek
2008-01-15 23:44 ` Daniel Phillips
2008-01-16 16:38 ` Christoph Hellwig
1 sibling, 1 reply; 39+ messages in thread
From: Pavel Machek @ 2008-01-15 23:07 UTC (permalink / raw)
To: David Chinner
Cc: Alan Cox, Theodore Tso, Al Boldi, Valerie Henson, Rik van Riel,
linux-fsdevel, linux-kernel
Hi!
> > > > What are ext3 expectations of disk (is there doc somewhere)? For
> > > > example... if disk does not lie, but powerfail during write damages
> > > > the sector -- is ext3 still going to work properly?
> > >
> > > Nope. However the few disks that did this rapidly got firmware updates
> > > because there are other OS's that can't cope.
> > >
> > > > If disk does not lie, but powerfail during write may cause random
> > > > numbers to be returned on read -- can fsck handle that?
> > >
> > > most of the time. and fsck knows about writing sectors to remove read
> > > errors in metadata blocks.
> > >
> > > > What abou disk that kills 5 sectors around sector being written during
> > > > powerfail; can ext3 survive that?
> > >
> > > generally. Note btw that for added fun there is nothing that guarantees
> > > the blocks around a block on the media are sequentially numbered. The
> > > usually are but you never know.
> >
> > Ok, should something like this be added to the documentation?
> >
> > It would be cool to be able to include few examples (modern SATA disks
> > support bariers so are safe, any IDE from 1989 is unsafe), but I do
> > not know enough about hw...
>
> ext3 is not the only filesystem that will have trouble due to
> volatile write caches. We see problems often enough with XFS
> due to volatile write caches that it's in our FAQ:
>
> http://oss.sgi.com/projects/xfs/faq.html#wcache
Nice FAQ, yep. Perhaps you should move parts of it to Documentation/ ,
and I could then make ext3 FAQ point to it?
I had write cache enabled on my main computer. Oops. I guess that
means we do need better documentation.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-15 23:07 ` Pavel Machek
@ 2008-01-15 23:44 ` Daniel Phillips
2008-01-16 0:15 ` Alan Cox
2008-01-16 11:51 ` Pavel Machek
0 siblings, 2 replies; 39+ messages in thread
From: Daniel Phillips @ 2008-01-15 23:44 UTC (permalink / raw)
To: Pavel Machek
Cc: David Chinner, Alan Cox, Theodore Tso, Al Boldi, Valerie Henson,
Rik van Riel, linux-fsdevel, linux-kernel
On Jan 15, 2008 6:07 PM, Pavel Machek <pavel@ucw.cz> wrote:
> I had write cache enabled on my main computer. Oops. I guess that
> means we do need better documentation.
Writeback cache on disk in iteself is not bad, it only gets bad if the
disk is not engineered to save all its dirty cache on power loss,
using the disk motor as a generator or alternatively a small battery.
It would be awfully nice to know which brands fail here, if any,
because writeback cache is a big performance booster.
Regards,
Daniel
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-15 23:44 ` Daniel Phillips
@ 2008-01-16 0:15 ` Alan Cox
2008-01-16 1:24 ` Daniel Phillips
2008-01-16 21:28 ` Eric Sandeen
2008-01-16 11:51 ` Pavel Machek
1 sibling, 2 replies; 39+ messages in thread
From: Alan Cox @ 2008-01-16 0:15 UTC (permalink / raw)
To: Daniel Phillips
Cc: Pavel Machek, David Chinner, Theodore Tso, Al Boldi,
Valerie Henson, Rik van Riel, linux-fsdevel, linux-kernel
> Writeback cache on disk in iteself is not bad, it only gets bad if the
> disk is not engineered to save all its dirty cache on power loss,
> using the disk motor as a generator or alternatively a small battery.
> It would be awfully nice to know which brands fail here, if any,
> because writeback cache is a big performance booster.
AFAIK no drive saves the cache. The worst case cache flush for drives is
several seconds with no retries and a couple of minutes if something
really bad happens.
This is why the kernel has some knowledge of barriers and uses them to
issue flushes when needed.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 0:15 ` Alan Cox
@ 2008-01-16 1:24 ` Daniel Phillips
2008-01-16 1:36 ` Chris Mason
2008-01-16 21:28 ` Eric Sandeen
1 sibling, 1 reply; 39+ messages in thread
From: Daniel Phillips @ 2008-01-16 1:24 UTC (permalink / raw)
To: Alan Cox
Cc: Pavel Machek, David Chinner, Theodore Tso, Al Boldi,
Valerie Henson, Rik van Riel, linux-fsdevel, linux-kernel
On Jan 15, 2008 7:15 PM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > Writeback cache on disk in iteself is not bad, it only gets bad if the
> > disk is not engineered to save all its dirty cache on power loss,
> > using the disk motor as a generator or alternatively a small battery.
> > It would be awfully nice to know which brands fail here, if any,
> > because writeback cache is a big performance booster.
>
> AFAIK no drive saves the cache. The worst case cache flush for drives is
> several seconds with no retries and a couple of minutes if something
> really bad happens.
>
> This is why the kernel has some knowledge of barriers and uses them to
> issue flushes when needed.
Indeed, you are right, which is supported by actual measurements:
http://sr5tech.com/write_back_cache_experiments.htm
Sorry for implying that anybody has engineered a drive that can do
such a nice thing with writeback cache.
The "disk motor as a generator" tale may not be purely folklore. When
an IDE drive is not in writeback mode, something special needs to done
to ensure the last write to media is not a scribble.
A small UPS can make writeback mode actually reliable, provided the
system is smart enough to take the drives out of writeback mode when
the line power is off.
Regards,
Daniel
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 1:24 ` Daniel Phillips
@ 2008-01-16 1:36 ` Chris Mason
2008-01-17 20:54 ` Pavel Machek
0 siblings, 1 reply; 39+ messages in thread
From: Chris Mason @ 2008-01-16 1:36 UTC (permalink / raw)
To: Daniel Phillips
Cc: Alan Cox, Pavel Machek, David Chinner, Theodore Tso, Al Boldi,
Valerie Henson, Rik van Riel, linux-fsdevel, linux-kernel
On Tue, 15 Jan 2008 20:24:27 -0500
"Daniel Phillips" <phillips@google.com> wrote:
> On Jan 15, 2008 7:15 PM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > Writeback cache on disk in iteself is not bad, it only gets bad
> > > if the disk is not engineered to save all its dirty cache on
> > > power loss, using the disk motor as a generator or alternatively
> > > a small battery. It would be awfully nice to know which brands
> > > fail here, if any, because writeback cache is a big performance
> > > booster.
> >
> > AFAIK no drive saves the cache. The worst case cache flush for
> > drives is several seconds with no retries and a couple of minutes
> > if something really bad happens.
> >
> > This is why the kernel has some knowledge of barriers and uses them
> > to issue flushes when needed.
>
> Indeed, you are right, which is supported by actual measurements:
>
> http://sr5tech.com/write_back_cache_experiments.htm
>
> Sorry for implying that anybody has engineered a drive that can do
> such a nice thing with writeback cache.
>
> The "disk motor as a generator" tale may not be purely folklore. When
> an IDE drive is not in writeback mode, something special needs to done
> to ensure the last write to media is not a scribble.
>
> A small UPS can make writeback mode actually reliable, provided the
> system is smart enough to take the drives out of writeback mode when
> the line power is off.
We've had mount -o barrier=1 for ext3 for a while now, it makes
writeback caching safe. XFS has this on by default, as does reiserfs.
-chris
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-15 20:16 ` [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck) Pavel Machek
2008-01-15 21:43 ` David Chinner
@ 2008-01-16 1:44 ` Daniel Phillips
2008-01-16 3:05 ` Rik van Riel
` (2 more replies)
1 sibling, 3 replies; 39+ messages in thread
From: Daniel Phillips @ 2008-01-16 1:44 UTC (permalink / raw)
To: Pavel Machek
Cc: Alan Cox, Theodore Tso, Al Boldi, Valerie Henson, Rik van Riel,
linux-fsdevel, linux-kernel
Hi Pavel,
Along with this effort, could you let me know if the world actually
cares about online fsck? Now we know how to do it I think, but is it
worth the effort.
Regards,
Daniel
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 1:44 ` Daniel Phillips
@ 2008-01-16 3:05 ` Rik van Riel
2008-01-17 7:38 ` Andreas Dilger
2008-01-16 11:49 ` Pavel Machek
2008-01-17 12:29 ` Szabolcs Szakacsits
2 siblings, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2008-01-16 3:05 UTC (permalink / raw)
To: Daniel Phillips
Cc: Pavel Machek, Alan Cox, Theodore Tso, Al Boldi, Valerie Henson,
linux-fsdevel, linux-kernel
On Tue, 15 Jan 2008 20:44:38 -0500
"Daniel Phillips" <phillips@google.com> wrote:
> Along with this effort, could you let me know if the world actually
> cares about online fsck? Now we know how to do it I think, but is it
> worth the effort.
With a filesystem that is compartmentalized and checksums metadata,
I believe that an online fsck is absolutely worth having.
Instead of the filesystem resorting to mounting the whole volume
read-only on certain errors, part of the filesystem can be offlined
while an fsck runs. This could even be done automatically in many
situations.
--
All rights reversed.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 1:44 ` Daniel Phillips
2008-01-16 3:05 ` Rik van Riel
@ 2008-01-16 11:49 ` Pavel Machek
2008-01-16 20:52 ` Valerie Henson
2008-01-17 12:29 ` Szabolcs Szakacsits
2 siblings, 1 reply; 39+ messages in thread
From: Pavel Machek @ 2008-01-16 11:49 UTC (permalink / raw)
To: Daniel Phillips
Cc: Alan Cox, Theodore Tso, Al Boldi, Valerie Henson, Rik van Riel,
linux-fsdevel, linux-kernel
Hi!
> Along with this effort, could you let me know if the world actually
> cares about online fsck?
I'm not the world's spokeperson (yet ;-).
> Now we know how to do it I think, but is it
> worth the effort.
ext3's "lets fsck on every 20 mounts" is good idea, but it can be
annoying when developing. Having option to fsck while filesystem is
online takes that annoyance away.
So yes, it would be very useful for me...
For long-running servers, this may be less of a problem... but OTOH
their filesystems are not checked at all as long servers are
online... so online fsck is actually important there, too, but for
other reasons.
So yes, it is very useful for world.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-15 23:44 ` Daniel Phillips
2008-01-16 0:15 ` Alan Cox
@ 2008-01-16 11:51 ` Pavel Machek
2008-01-16 12:20 ` Valdis.Kletnieks
1 sibling, 1 reply; 39+ messages in thread
From: Pavel Machek @ 2008-01-16 11:51 UTC (permalink / raw)
To: Daniel Phillips
Cc: David Chinner, Alan Cox, Theodore Tso, Al Boldi, Valerie Henson,
Rik van Riel, linux-fsdevel, linux-kernel
On Tue 2008-01-15 18:44:26, Daniel Phillips wrote:
> On Jan 15, 2008 6:07 PM, Pavel Machek <pavel@ucw.cz> wrote:
> > I had write cache enabled on my main computer. Oops. I guess that
> > means we do need better documentation.
>
> Writeback cache on disk in iteself is not bad, it only gets bad if the
> disk is not engineered to save all its dirty cache on power loss,
> using the disk motor as a generator or alternatively a small battery.
> It would be awfully nice to know which brands fail here, if any,
> because writeback cache is a big performance booster.
Is it?
I guess I should try to measure it. (Linux already does writeback
caching, with 2GB of memory. I wonder how important disks's 2MB of
cache can be).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 11:51 ` Pavel Machek
@ 2008-01-16 12:20 ` Valdis.Kletnieks
2008-01-19 14:51 ` Pavel Machek
0 siblings, 1 reply; 39+ messages in thread
From: Valdis.Kletnieks @ 2008-01-16 12:20 UTC (permalink / raw)
To: Pavel Machek
Cc: Daniel Phillips, David Chinner, Alan Cox, Theodore Tso, Al Boldi,
Valerie Henson, Rik van Riel, linux-fsdevel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 680 bytes --]
On Wed, 16 Jan 2008 12:51:44 +0100, Pavel Machek said:
> I guess I should try to measure it. (Linux already does writeback
> caching, with 2GB of memory. I wonder how important disks's 2MB of
> cache can be).
It serves essentially the same purpose as the 'async' option in /etc/exports
(i.e. we declare it "done" when the other end of the wire says it's caught
the data, not when it's actually committed), with similar latency wins. Of
course, it's impedance-matching for bursty traffic - the 2M doesn't do much
at all if you're streaming data to it. For what it's worth, the 80G Seagate
drive in my laptop claims it has 8M, so it probably does 4 times as much
good as 2M. ;)
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-15 21:43 ` David Chinner
2008-01-15 23:07 ` Pavel Machek
@ 2008-01-16 16:38 ` Christoph Hellwig
1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2008-01-16 16:38 UTC (permalink / raw)
To: David Chinner
Cc: Pavel Machek, Alan Cox, Theodore Tso, Al Boldi, Valerie Henson,
Rik van Riel, linux-fsdevel, linux-kernel
On Wed, Jan 16, 2008 at 08:43:25AM +1100, David Chinner wrote:
> ext3 is not the only filesystem that will have trouble due to
> volatile write caches. We see problems often enough with XFS
> due to volatile write caches that it's in our FAQ:
In fact it will hit every filesystem. A write-back cache that can't
be forced to write back bythe filesystem will cause corruption on
uncontained power loss, period.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 11:49 ` Pavel Machek
@ 2008-01-16 20:52 ` Valerie Henson
0 siblings, 0 replies; 39+ messages in thread
From: Valerie Henson @ 2008-01-16 20:52 UTC (permalink / raw)
To: Pavel Machek
Cc: Daniel Phillips, Alan Cox, Theodore Tso, Al Boldi, Rik van Riel,
linux-fsdevel, linux-kernel
On Jan 16, 2008 3:49 AM, Pavel Machek <pavel@ucw.cz> wrote:
>
> ext3's "lets fsck on every 20 mounts" is good idea, but it can be
> annoying when developing. Having option to fsck while filesystem is
> online takes that annoyance away.
I'm sure everyone on cc: knows this, but for the record you can change
ext3's fsck on N mounts or every N days to something that makes sense
for your use case. Usually I just turn it off entirely and run fsck
by hand when I'm worried:
# tune2fs -c 0 -i 0 /dev/whatever
-VAL
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 0:15 ` Alan Cox
2008-01-16 1:24 ` Daniel Phillips
@ 2008-01-16 21:28 ` Eric Sandeen
1 sibling, 0 replies; 39+ messages in thread
From: Eric Sandeen @ 2008-01-16 21:28 UTC (permalink / raw)
To: Alan Cox
Cc: Daniel Phillips, Pavel Machek, David Chinner, Theodore Tso,
Al Boldi, Valerie Henson, Rik van Riel, linux-fsdevel,
linux-kernel
Alan Cox wrote:
>> Writeback cache on disk in iteself is not bad, it only gets bad if the
>> disk is not engineered to save all its dirty cache on power loss,
>> using the disk motor as a generator or alternatively a small battery.
>> It would be awfully nice to know which brands fail here, if any,
>> because writeback cache is a big performance booster.
>
> AFAIK no drive saves the cache. The worst case cache flush for drives is
> several seconds with no retries and a couple of minutes if something
> really bad happens.
>
> This is why the kernel has some knowledge of barriers and uses them to
> issue flushes when needed.
Problem is, ext3 has barriers off by default so it's not saving most people.
And then if you turn them on, but have your filesystem on an lvm device,
lvm strips them out again.
-Eric
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 3:05 ` Rik van Riel
@ 2008-01-17 7:38 ` Andreas Dilger
0 siblings, 0 replies; 39+ messages in thread
From: Andreas Dilger @ 2008-01-17 7:38 UTC (permalink / raw)
To: Rik van Riel
Cc: Daniel Phillips, Pavel Machek, Alan Cox, Theodore Tso, Al Boldi,
Valerie Henson, linux-fsdevel, linux-kernel
On Jan 15, 2008 22:05 -0500, Rik van Riel wrote:
> With a filesystem that is compartmentalized and checksums metadata,
> I believe that an online fsck is absolutely worth having.
>
> Instead of the filesystem resorting to mounting the whole volume
> read-only on certain errors, part of the filesystem can be offlined
> while an fsck runs. This could even be done automatically in many
> situations.
In ext4 we store per-group state flags in each group, and the group
descriptor is checksummed (to detect spurious flags), so it should
be relatively straight forward to store an "error" flag in a single
group and have it become read-only.
As a starting point, it would be worthwhile to check instances of
ext4_error() to see how many of them can be targetted at a specific
group. I'd guess most of them could be (corrupt inodes, directory
and indirect blocks, incorrect bitmaps).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 1:44 ` Daniel Phillips
2008-01-16 3:05 ` Rik van Riel
2008-01-16 11:49 ` Pavel Machek
@ 2008-01-17 12:29 ` Szabolcs Szakacsits
2008-01-17 22:51 ` Daniel Phillips
2 siblings, 1 reply; 39+ messages in thread
From: Szabolcs Szakacsits @ 2008-01-17 12:29 UTC (permalink / raw)
To: Daniel Phillips
Cc: Pavel Machek, Alan Cox, Theodore Tso, Al Boldi, Valerie Henson,
Rik van Riel, linux-fsdevel, linux-kernel
On Tue, 15 Jan 2008, Daniel Phillips wrote:
> Along with this effort, could you let me know if the world actually
> cares about online fsck? Now we know how to do it I think, but is it
> worth the effort.
Most users seem to care deeply about "things just work". Here is why
ntfs-3g also took the online fsck path some time ago.
NTFS support had a highly bad reputation on Linux thus the new code was
written with rigid sanity checks and extensive automatic, regression
testing. One of the consequences is that we're detecting way too many
inconsistencies left behind by the Windows and other NTFS drivers,
hardware faults, device drivers.
To better utilize the non-existing developer resources, it was obvious to
suggest the already existing Windows fsck (chkdsk) in such cases. Simple
and safe as most people like us would think who never used Windows.
However years of experience shows that depending on several factors chkdsk
may start or not, may report the real problems or not, but on the other
hand it may report bogus issues, it may run long or just forever, and it
may even remove completely valid files. So one could perhaps even consider
suggestions to run chkdsk a call to play Russian roulette.
Thankfully NTFS has some level of metadata redundancy with signatures and
weak "checksums" which make possible to correct some common and obvious
corruptions on the fly.
Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:
http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668-4e15-b7ad-4df0f6e6805d1033.mspx?mfr=true
Szaka
--
NTFS-3G: http://ntfs-3g.org
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 1:36 ` Chris Mason
@ 2008-01-17 20:54 ` Pavel Machek
0 siblings, 0 replies; 39+ messages in thread
From: Pavel Machek @ 2008-01-17 20:54 UTC (permalink / raw)
To: Chris Mason
Cc: Daniel Phillips, Alan Cox, David Chinner, Theodore Tso, Al Boldi,
Valerie Henson, Rik van Riel, linux-fsdevel, linux-kernel
On Tue 2008-01-15 20:36:16, Chris Mason wrote:
> On Tue, 15 Jan 2008 20:24:27 -0500
> "Daniel Phillips" <phillips@google.com> wrote:
>
> > On Jan 15, 2008 7:15 PM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > > Writeback cache on disk in iteself is not bad, it only gets bad
> > > > if the disk is not engineered to save all its dirty cache on
> > > > power loss, using the disk motor as a generator or alternatively
> > > > a small battery. It would be awfully nice to know which brands
> > > > fail here, if any, because writeback cache is a big performance
> > > > booster.
> > >
> > > AFAIK no drive saves the cache. The worst case cache flush for
> > > drives is several seconds with no retries and a couple of minutes
> > > if something really bad happens.
> > >
> > > This is why the kernel has some knowledge of barriers and uses them
> > > to issue flushes when needed.
> >
> > Indeed, you are right, which is supported by actual measurements:
> >
> > http://sr5tech.com/write_back_cache_experiments.htm
> >
> > Sorry for implying that anybody has engineered a drive that can do
> > such a nice thing with writeback cache.
> >
> > The "disk motor as a generator" tale may not be purely folklore. When
> > an IDE drive is not in writeback mode, something special needs to done
> > to ensure the last write to media is not a scribble.
> >
> > A small UPS can make writeback mode actually reliable, provided the
> > system is smart enough to take the drives out of writeback mode when
> > the line power is off.
>
> We've had mount -o barrier=1 for ext3 for a while now, it makes
> writeback caching safe. XFS has this on by default, as does reiserfs.
Maybe ext3 should do barriers by default? Having ext3 in "lets corrupt
data by default"... seems like bad idea.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-17 12:29 ` Szabolcs Szakacsits
@ 2008-01-17 22:51 ` Daniel Phillips
0 siblings, 0 replies; 39+ messages in thread
From: Daniel Phillips @ 2008-01-17 22:51 UTC (permalink / raw)
To: Szabolcs Szakacsits
Cc: Pavel Machek, Alan Cox, Theodore Tso, Al Boldi, Valerie Henson,
Rik van Riel, linux-fsdevel, linux-kernel
On Jan 17, 2008 7:29 AM, Szabolcs Szakacsits <szaka@ntfs-3g.org> wrote:
> Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:
I guess that is enough votes to justify going ahead and trying an
implementation of the reverse mapping ideas I posted. But of course
more votes for this is better. If online incremental fsck is
something people want, then please speak up here and that will very
definitely help make it happen.
On the walk-before-run principle, it would initially just be
filesystem checking, not repair. But even this would help, by setting
per-group checked flags that offline fsck could use to do a much
quicker repair pass. And it will let you know when a volume needs to
be taken offline without having to build in planned downtime just in
case, which already eats a bunch of nines.
Regards,
Daniel
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
2008-01-16 12:20 ` Valdis.Kletnieks
@ 2008-01-19 14:51 ` Pavel Machek
0 siblings, 0 replies; 39+ messages in thread
From: Pavel Machek @ 2008-01-19 14:51 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Daniel Phillips, David Chinner, Alan Cox, Theodore Tso, Al Boldi,
Valerie Henson, Rik van Riel, linux-fsdevel, linux-kernel
Hi!
> > I guess I should try to measure it. (Linux already does writeback
> > caching, with 2GB of memory. I wonder how important disks's 2MB of
> > cache can be).
>
> It serves essentially the same purpose as the 'async' option in /etc/exports
> (i.e. we declare it "done" when the other end of the wire says it's caught
> the data, not when it's actually committed), with similar latency wins. Of
> course, it's impedance-matching for bursty traffic - the 2M doesn't do much
> at all if you're streaming data to it. For what it's worth, the 80G Seagate
> drive in my laptop claims it has 8M, so it probably does 4 times as much
> good as 2M. ;)
I doubt "impedance-matching" is useful here. SATA link is fast/low
latency, and kernel already does buffering with main memory...
Hmm... what is the way to measure that? Tar decompress kernel few
times with cache on / cache off?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2008-01-19 14:50 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-08 21:22 [RFD] Incremental fsck Al Boldi
2008-01-08 21:31 ` Alan
2008-01-09 9:16 ` Andreas Dilger
2008-01-12 23:55 ` Daniel Phillips
2008-01-08 21:41 ` Rik van Riel
2008-01-09 4:40 ` Al Boldi
2008-01-09 7:45 ` Valerie Henson
2008-01-09 11:52 ` Al Boldi
2008-01-09 14:44 ` Rik van Riel
2008-01-10 13:26 ` Al Boldi
2008-01-12 14:51 ` Theodore Tso
2008-01-13 11:05 ` Al Boldi
2008-01-13 17:19 ` Pavel Machek
2008-01-13 17:41 ` Alan Cox
2008-01-15 20:16 ` [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck) Pavel Machek
2008-01-15 21:43 ` David Chinner
2008-01-15 23:07 ` Pavel Machek
2008-01-15 23:44 ` Daniel Phillips
2008-01-16 0:15 ` Alan Cox
2008-01-16 1:24 ` Daniel Phillips
2008-01-16 1:36 ` Chris Mason
2008-01-17 20:54 ` Pavel Machek
2008-01-16 21:28 ` Eric Sandeen
2008-01-16 11:51 ` Pavel Machek
2008-01-16 12:20 ` Valdis.Kletnieks
2008-01-19 14:51 ` Pavel Machek
2008-01-16 16:38 ` Christoph Hellwig
2008-01-16 1:44 ` Daniel Phillips
2008-01-16 3:05 ` Rik van Riel
2008-01-17 7:38 ` Andreas Dilger
2008-01-16 11:49 ` Pavel Machek
2008-01-16 20:52 ` Valerie Henson
2008-01-17 12:29 ` Szabolcs Szakacsits
2008-01-17 22:51 ` Daniel Phillips
2008-01-15 1:04 ` [RFD] Incremental fsck Ric Wheeler
2008-01-14 0:22 ` Daniel Phillips
2008-01-09 8:04 ` Valdis.Kletnieks
[not found] <9JubJ-5mo-57@gated-at.bofh.it>
[not found] ` <9JB3e-85S-13@gated-at.bofh.it>
[not found] ` <9JDRm-4bR-1@gated-at.bofh.it>
[not found] ` <9JHLl-2dL-1@gated-at.bofh.it>
2008-01-11 14:20 ` Bodo Eggert
2008-01-12 10:20 ` Al Boldi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).