public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* ReiserFS data corruption in very simple configuration
@ 2001-09-22 10:00 foner-reiserfs
  2001-09-22 12:47 ` Nikita Danilov
  2001-10-01 15:27 ` Hans Reiser
  0 siblings, 2 replies; 24+ messages in thread
From: foner-reiserfs @ 2001-09-22 10:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: foner-reiserfs

[Please CC me on any replies; I'm not on linux-kernel.]

The ReiserFS that comes with both Mandrake 7.2 and 8.0 has
demonstrated a serious data corruption problem, and I'd like
to know (a) if anyone else has seen this, (b) how to avoid it,
and (c) how to determine how badly I've been bitten.

My configuration in each case has been an AMD CPU running ReiserFS
exactly as configured "out of the box" by running the Mandrake 7.2 or
8.0 installation CD and opting to run ReiserFS instead of the default.
This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID
or anything fancy like that.  The hardware itself is rock solid and
has never demonstrated any faults at all.  (MDK 8.0 appears to use
RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.)
The machine had barely been used before each corruption problem; I'm
not running some strange root-priv stuff, and each time, the FS hadn't
had more than a few minutes to a few hours of use since being created.

In each case, I've gotten in trouble by editing my XF86Config-4 file,
guessing wrong on a modeline, hanging X (blank gray screen & no
response to anything), and being forced to hit the reset button
because nothing else worked.  Under 7.2, I discovered that my
XF86Config-4 file suddenly had a block of nulls in it.  That time, I
thought I must have been hallucinating, but I ran a background job to
sync the filesystem every second while continuing to debug the X
problems, and didn't see the corruption again.

Now, I was just bitten by the -same- behavior under MDK 8.0.  After
accidentally hanging X, I waited a few seconds just in case a sync was
pending, hit reset, and had all sorts of lossage:
  (1) Parts of the XF86Conf-4 file had lines garbled, e.g.,
      sections of the file had apparently been rearranged.
  (2) /var/log/XFree86.0.log was truncated, and maybe garbled.
  (2) Logging in as root was fine, but then logging in as myself
      I got "Last login: <4-5 lines of my XFree86.0.log file (!)>"
      instead of a date!  Logging in again gave me the proper
      last-login time, but clearly wtmp or something else had
      gotten stepped on in some weird way.
Obviously, the behavior I saw once under MDK 7.2 was no hallucination
or accidental yank in Emacs.

I thought the whole point of a journalling file system was to
-prevent- corruption due to an unexpected failure!  This seems to be
-far- worse than a normal filesystem---ext2fs would at least choke and
force fsck to be run, which might actually fix the problem, but this
is ridiculous---it just silently trashes random files.

So I now have possibly-undetected filesystem damage.  My -guess- is
that only files written within a few minutes of the reset are likely
to be affected, but I really don't know, and don't know of a good way
to find out.  Must I reinstall the OS -again-, starting from a blank
partition, to be sure?  Maybe I should just give up on ReiserFS completely.

[If there is a more-appropriate place for me to send this---such as
a particular Mandrake list, or a particular ReiserFS list---please let
me know, particularly if I can get a quick answer -without- going
through the overhead of subscribing to the list, being flooded, and
unsubscribing---that's what archives are for.  Some websearching
for "ReiserFS corruption" yielded -thousands- of hits---not a good
sign---and a very large proportion of them were on this list, so I
figure this is as good a place to ask as any.  Thanks again.]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-09-22 10:00 ReiserFS data corruption in very simple configuration foner-reiserfs
@ 2001-09-22 12:47 ` Nikita Danilov
  2001-09-22 20:44   ` foner-reiserfs
                     ` (2 more replies)
  2001-10-01 15:27 ` Hans Reiser
  1 sibling, 3 replies; 24+ messages in thread
From: Nikita Danilov @ 2001-09-22 12:47 UTC (permalink / raw)
  To: foner-reiserfs; +Cc: linux-kernel, Reiserfs mail-list

foner-reiserfs@media.mit.edu writes:
 > [Please CC me on any replies; I'm not on linux-kernel.]
 > 
 > The ReiserFS that comes with both Mandrake 7.2 and 8.0 has
 > demonstrated a serious data corruption problem, and I'd like
 > to know (a) if anyone else has seen this, (b) how to avoid it,
 > and (c) how to determine how badly I've been bitten.
 > 
 > My configuration in each case has been an AMD CPU running ReiserFS
 > exactly as configured "out of the box" by running the Mandrake 7.2 or
 > 8.0 installation CD and opting to run ReiserFS instead of the default.
 > This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID
 > or anything fancy like that.  The hardware itself is rock solid and
 > has never demonstrated any faults at all.  (MDK 8.0 appears to use
 > RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.)
 > The machine had barely been used before each corruption problem; I'm
 > not running some strange root-priv stuff, and each time, the FS hadn't
 > had more than a few minutes to a few hours of use since being created.
 > 
 > In each case, I've gotten in trouble by editing my XF86Config-4 file,
 > guessing wrong on a modeline, hanging X (blank gray screen & no
 > response to anything), and being forced to hit the reset button
 > because nothing else worked.  Under 7.2, I discovered that my
 > XF86Config-4 file suddenly had a block of nulls in it.  That time, I
 > thought I must have been hallucinating, but I ran a background job to
 > sync the filesystem every second while continuing to debug the X
 > problems, and didn't see the corruption again.
 > 
 > Now, I was just bitten by the -same- behavior under MDK 8.0.  After
 > accidentally hanging X, I waited a few seconds just in case a sync was
 > pending, hit reset, and had all sorts of lossage:
 >   (1) Parts of the XF86Conf-4 file had lines garbled, e.g.,
 >       sections of the file had apparently been rearranged.
 >   (2) /var/log/XFree86.0.log was truncated, and maybe garbled.
 >   (2) Logging in as root was fine, but then logging in as myself
 >       I got "Last login: <4-5 lines of my XFree86.0.log file (!)>"
 >       instead of a date!  Logging in again gave me the proper
 >       last-login time, but clearly wtmp or something else had
 >       gotten stepped on in some weird way.
 > Obviously, the behavior I saw once under MDK 7.2 was no hallucination
 > or accidental yank in Emacs.
 > 
 > I thought the whole point of a journalling file system was to
 > -prevent- corruption due to an unexpected failure!  This seems to be
 > -far- worse than a normal filesystem---ext2fs would at least choke and
 > force fsck to be run, which might actually fix the problem, but this
 > is ridiculous---it just silently trashes random files.

Stock reiserfs only provides meta-data journalling. It guarantees that
structure of you file-system will be correct after journal replay, not
content of a files. It will never "trash" file that wasn't accessed at
the moment of crash, though. Full data-journaling comes at cost. There
is patch by Chris Mason <Mason@Suse.COM> to support data journaling in
reiserfs. Ext3 supports it also.

 > 
 > So I now have possibly-undetected filesystem damage.  My -guess- is
 > that only files written within a few minutes of the reset are likely
 > to be affected, but I really don't know, and don't know of a good way
 > to find out.  Must I reinstall the OS -again-, starting from a blank
 > partition, to be sure?  Maybe I should just give up on ReiserFS completely.
 > 
 > [If there is a more-appropriate place for me to send this---such as
 > a particular Mandrake list, or a particular ReiserFS list---please let
 > me know, particularly if I can get a quick answer -without- going

Reiserfs mail-list <Reiserfs-List@Namesys.COM>,
archive at http://marc.theaimsgroup.com/?l=reiserfs&r=1&w=2

 > through the overhead of subscribing to the list, being flooded, and
 > unsubscribing---that's what archives are for.  Some websearching
 > for "ReiserFS corruption" yielded -thousands- of hits---not a good
 > sign---and a very large proportion of them were on this list, so I
 > figure this is as good a place to ask as any.  Thanks again.]

Nikita.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* ReiserFS data corruption in very simple configuration
  2001-09-22 12:47 ` Nikita Danilov
@ 2001-09-22 20:44   ` foner-reiserfs
  2001-09-25 13:28     ` Stephen C. Tweedie
  2001-09-24  9:25   ` [reiserfs-list] " Jens Benecke
  2001-09-25 20:13   ` Mike Fedyk
  2 siblings, 1 reply; 24+ messages in thread
From: foner-reiserfs @ 2001-09-22 20:44 UTC (permalink / raw)
  To: Nikita; +Cc: linux-kernel, Reiserfs-List, foner-reiserfs

    Date: Sat, 22 Sep 2001 16:47:31 +0400
    From: Nikita Danilov <Nikita@Namesys.COM>

    Stock reiserfs only provides meta-data journalling. It guarantees that
    structure of you file-system will be correct after journal replay, not
    content of a files. It will never "trash" file that wasn't accessed at
    the moment of crash, though.

Thanks for clarifying this.  However, I should point out that the
failure mode is quite serious---whereas ext2fs would simply fail
to record data written to a file before a sync, reiserfs seems to
have instead -swapped random pieces of one file with another-,
which is -much- harder to detect and fix.  I can live with uncommitted
changes, but it's hard to justify the behavior I saw---it means that
any even slightly-busy machine that experiences a crash could have
dozens or hundreds of files with each others' contents all mixed
together---remember, parts of my XF86Config file wound up in wtmp!
And both XF86Config and wtmp had been written at least 20 seconds
before I had to push the reset button, and perhaps > 30 seconds; I
don't recall how often the FS is syncing by default, but it's
disturbing behavior.  After all, at the time I pushed reset, I had
-no- files actually being written by any process under my direct
control; I'd merely written one file out from Emacs under a minute
earlier.  I'd hate to think of what would happen if this sort of thing
hit a CVS repository.

This seems to outweigh the convenience of a rapid start after failure
(one of the reasons I decided to try reiserfs in the first place),
because a failure means possibly having to recover an entire file
server from backups (hence losing -lots more- data) because you don't
know which files might have been trashed if the machine loses power or
the kernel panics.  There's no simple test ("did my edits make it into
the file?"), and no way to really know if the machine might later
misbehave because critical files have been overwritten with parts of
others.  (This inability to easily figure out what might have been
affected also means that the damage will rapidly propagate to backups,
hence making the backups useless.)  About the only way around it would
seem to be to checksum every file in the FS at regular intervals, and
rechecksum after a crash---at which point, what's the point of not
having to run fsck?

Is this -really- how reiserfs is supposed to behave in a crash?
It seems like this should be prominently documented in the description
of the file system---I know that I'm rather nervous about using it if
this is true, since it turns a few minutes of fsck'ing (for ext2fs)
into a restore-the-whole-file-system exercise instead.  Surely that's
not right.  If this is really supposed to be how reiserfs behaves any
time it doesn't get to sync before a machine dies on it, I can't see
how it can be justified for any production use, and I'll probably have
to reinstall my OS using ext2fs instead.

                                 Full data-journaling comes at cost. There
    is patch by Chris Mason <Mason@Suse.COM> to support data journaling in
    reiserfs. Ext3 supports it also.

Do you have a URL for this?  A search for reiserfs and mason yields
12,000 hits.  (I'm particularly interested in one for reiserfs 3.6.25
and Mandrake 8.0, but I assume there may be several variants in the
same repository.)

     > So I now have possibly-undetected filesystem damage.  My -guess- is
     > that only files written within a few minutes of the reset are likely
     > to be affected, but I really don't know, and don't know of a good way
     > to find out.  Must I reinstall the OS -again-, starting from a blank
     > partition, to be sure?  Maybe I should just give up on ReiserFS completely.
     > 
     > [If there is a more-appropriate place for me to send this---such as
     > a particular Mandrake list, or a particular ReiserFS list---please let
     > me know, particularly if I can get a quick answer -without- going

    Reiserfs mail-list <Reiserfs-List@Namesys.COM>,
    archive at http://marc.theaimsgroup.com/?l=reiserfs&r=1&w=2

Thanks.  I saw that list before, and I'm glad that you've included it
in this discussion.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple configuration
  2001-09-22 12:47 ` Nikita Danilov
  2001-09-22 20:44   ` foner-reiserfs
@ 2001-09-24  9:25   ` Jens Benecke
  2001-10-14 14:52     ` Chris Mason
  2001-09-25 20:13   ` Mike Fedyk
  2 siblings, 1 reply; 24+ messages in thread
From: Jens Benecke @ 2001-09-24  9:25 UTC (permalink / raw)
  To: linux-kernel, Reiserfs mail-list

[-- Attachment #1: Type: text/plain, Size: 1474 bytes --]

On Sat, Sep 22, 2001 at 04:47:31PM +0400, Nikita Danilov wrote:
> foner-reiserfs@media.mit.edu writes:
>  > [Please CC me on any replies; I'm not on linux-kernel.]
>  > 
>  > The ReiserFS that comes with both Mandrake 7.2 and 8.0 has
>  > demonstrated a serious data corruption problem, and I'd like to know
>  > (a) if anyone else has seen this, (b) how to avoid it, and (c) how to
>  > determine how badly I've been bitten.
>  > 
> Stock reiserfs only provides meta-data journalling. It guarantees that
> structure of you file-system will be correct after journal replay, not
> content of a files. It will never "trash" file that wasn't accessed at
> the moment of crash, though. Full data-journaling comes at cost. There is
> patch by Chris Mason <Mason@Suse.COM> to support data journaling in
> reiserfs. Ext3 supports it also.

one question:

When I was using ext2 I always mounted the /usr partition read-only, so
that a fsck weren't necessary at boot - and the files were all guaranteed
to be OK to bring the system up at least.

Does this (mount -o ro) make sense with ReiserFS as well? What I mean is,
is there a chance of a file getting corrupted that was only *read* (not
*written*) at or before a power outage?
 

I mount all my system partitions with -o notail,noatime if that makes any
difference.


-- 
Jens Benecke ········ http://www.hitchhikers.de/ - Europas Mitfahrzentrale

                           rm -rf /bin/laden

[-- Attachment #2: Type: application/pgp-signature, Size: 240 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-09-22 20:44   ` foner-reiserfs
@ 2001-09-25 13:28     ` Stephen C. Tweedie
  2001-09-29  4:44       ` Lenny Foner
  0 siblings, 1 reply; 24+ messages in thread
From: Stephen C. Tweedie @ 2001-09-25 13:28 UTC (permalink / raw)
  To: foner-reiserfs; +Cc: Nikita, Stephen Tweedie, linux-kernel

Hi,

On Sat, Sep 22, 2001 at 04:44:21PM -0400, foner-reiserfs@media.mit.edu wrote:

>     Stock reiserfs only provides meta-data journalling. It guarantees that
>     structure of you file-system will be correct after journal replay, not
>     content of a files. It will never "trash" file that wasn't accessed at
>     the moment of crash, though.
> 
> Thanks for clarifying this.  However, I should point out that the
> failure mode is quite serious---whereas ext2fs would simply fail
> to record data written to a file before a sync, reiserfs seems to
> have instead -swapped random pieces of one file with another-,
> which is -much- harder to detect and fix.

Not true.  ext2, ext3 in its "data=writeback" mode, and reiserfs can
all demonstrate this behaviour.  Reiserfs is being no worse than ext2
(the timings may make the race more or less likely in reiserfs, but
ext2 _is_ vulnerable.)

e2fsck only restores metadata consistency on ext2 after a crash: it
can't possibly guarantee that all the data blocks have been written.

ext3 will let you do full data journaling, but also has a third mode
(the default), which doesn't journal data, but which does make sure
that data is flushed to disk before the transaction which allocated
that data is allowed to commit.  That gives you most of the
performance of ext3's fast-and-loose writeback mode, but with an
absolute guarantee that you never see stale blocks in a file after a
crash.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-09-22 12:47 ` Nikita Danilov
  2001-09-22 20:44   ` foner-reiserfs
  2001-09-24  9:25   ` [reiserfs-list] " Jens Benecke
@ 2001-09-25 20:13   ` Mike Fedyk
  2001-09-26 14:43     ` Stephen C. Tweedie
  2 siblings, 1 reply; 24+ messages in thread
From: Mike Fedyk @ 2001-09-25 20:13 UTC (permalink / raw)
  To: linux-kernel

On Sat, Sep 22, 2001 at 04:47:31PM +0400, Nikita Danilov wrote:
> foner-reiserfs@media.mit.edu writes:
>  > [Please CC me on any replies; I'm not on linux-kernel.]
>  > I thought the whole point of a journalling file system was to
>  > -prevent- corruption due to an unexpected failure!  This seems to be
>  > -far- worse than a normal filesystem---ext2fs would at least choke and
>  > force fsck to be run, which might actually fix the problem, but this
>  > is ridiculous---it just silently trashes random files.
> 
> Stock reiserfs only provides meta-data journalling. It guarantees that
> structure of you file-system will be correct after journal replay, not
> content of a files. It will never "trash" file that wasn't accessed at
> the moment of crash, though. Full data-journaling comes at cost. There
> is patch by Chris Mason <Mason@Suse.COM> to support data journaling in
> reiserfs. Ext3 supports it also.
> 

When files on a ReiserFS mount have data from other files, does that mean
that it has recovered wrong meta-data, or is it because the meta-data was
committed before the data?

So, if I write a file, does ReiserFS write the structures first, and if the
data isn't written, whatever else was deleted from the block before will now
be in the file?

If that's so, then one way to keep old deleted data from getting into
partially written files after a crash would be to zero out the blocks on
unlink.  I can imagine that this would prevent undelete, and slow down
deleting considerably.

Another way, may be to keep a journal of which blocks have actually been
committed.  Maybe a bitmap in the journal, or some other structure...

If you have data journaling, does that mean there is a possability of
recovering a complete file -before- it was written?  i.e:

echo a > test;
sync;
cat picture.tif > test
(writing in progress, only partially in journal)
power off

Will "a" be in test upon recovery?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-09-25 20:13   ` Mike Fedyk
@ 2001-09-26 14:43     ` Stephen C. Tweedie
  2001-10-01  3:38       ` Mike Fedyk
  0 siblings, 1 reply; 24+ messages in thread
From: Stephen C. Tweedie @ 2001-09-26 14:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: Mike Fedyk, Stephen Tweedie

Hi,

On Tue, Sep 25, 2001 at 01:13:04PM -0700, Mike Fedyk wrote:

> > Stock reiserfs only provides meta-data journalling. It guarantees that
> > structure of you file-system will be correct after journal replay, not
> > content of a files. It will never "trash" file that wasn't accessed at
> > the moment of crash, though. Full data-journaling comes at cost. There
> > is patch by Chris Mason <Mason@Suse.COM> to support data journaling in
> > reiserfs. Ext3 supports it also.
 
> When files on a ReiserFS mount have data from other files, does that mean
> that it has recovered wrong meta-data, or is it because the meta-data was
> committed before the data?

It can be either, but the former can only be the result of a problem
(either hardware fault or a data-corrupting software bug of some
description).  In the normal case, only the latter scenario happens.

ext3 has a mode to flush all data before metadata gets committed.
That is its default mode, and it avoids this problem without having to
actually journal the data.

> So, if I write a file, does ReiserFS write the structures first, and if the
> data isn't written, whatever else was deleted from the block before will now
> be in the file?

Yep.  ext3 behaves in the same way in its fastest "writeback" data
mode.

> If that's so, then one way to keep old deleted data from getting into
> partially written files after a crash would be to zero out the blocks on
> unlink.  I can imagine that this would prevent undelete, and slow down
> deleting considerably.

Indeed.

> Another way, may be to keep a journal of which blocks have actually been
> committed.  Maybe a bitmap in the journal, or some other structure...

ext3 does exactly that.  It's necessary to keep things in sync if we
have blocks of data being deleted and reallocated as metadata, or
vice-versa.

> If you have data journaling, does that mean there is a possability of
> recovering a complete file -before- it was written?  i.e:

> echo a > test;
> sync;
> cat picture.tif > test
> (writing in progress, only partially in journal)
> power off
 
> Will "a" be in test upon recovery?

If you are using full data journaling (ext3's "journal" data mode) or
the default "ordered" data mode, then no, you never see such
behaviour.

In the ordered mode, it achieves this precisely because it is keeping
a record of which blocks have been committed (or, more accurately,
which *deleted* blocks have had the delete committed).  If you do a
"cat > file", then before the new data is written, the file gets
truncated and all its old data blocks deleted.  ext3 will then refuse
to reuse those blocks until the delete has been committed, so if we
crash and end up rolling back the delete transaction, we'll never see
new data blocks in the old file.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* ReiserFS data corruption in very simple configuration
  2001-09-25 13:28     ` Stephen C. Tweedie
@ 2001-09-29  4:44       ` Lenny Foner
  2001-09-29 12:52         ` [reiserfs-list] " Lehmann 
  2001-10-01 11:30         ` Stephen C. Tweedie
  0 siblings, 2 replies; 24+ messages in thread
From: Lenny Foner @ 2001-09-29  4:44 UTC (permalink / raw)
  To: sct; +Cc: Nikita, Mason, linux-kernel, reiserfs-list, foner-reiserfs

[As before, please make sure you CC me on replies or I won't see them.  Tnx!]

    Date: Tue, 25 Sep 2001 14:28:54 +0100
    From: "Stephen C. Tweedie" <sct@redhat.com>

    Hi,

    On Sat, Sep 22, 2001 at 04:44:21PM -0400, foner-reiserfs@media.mit.edu wrote:

    >     Stock reiserfs only provides meta-data journalling. It guarantees that
    >     structure of you file-system will be correct after journal replay, not
    >     content of a files. It will never "trash" file that wasn't accessed at
    >     the moment of crash, though.
    > 
    > Thanks for clarifying this.  However, I should point out that the
    > failure mode is quite serious---whereas ext2fs would simply fail
    > to record data written to a file before a sync, reiserfs seems to
    > have instead -swapped random pieces of one file with another-,
    > which is -much- harder to detect and fix.

    Not true.  ext2, ext3 in its "data=writeback" mode, and reiserfs can
    all demonstrate this behaviour.  Reiserfs is being no worse than ext2
    (the timings may make the race more or less likely in reiserfs, but
    ext2 _is_ vulnerable.)

ext2fs can write parts of file A to file B, and vice versa, and this
isn't fixed by fsck?  [See outcome (d) below.]  I'm having difficulty
believing how this can be possible for a non-journalling filesystem.

    e2fsck only restores metadata consistency on ext2 after a crash: it
    can't possibly guarantee that all the data blocks have been written.

But what about written to the wrong files?  See below.

    ext3 will let you do full data journaling, but also has a third mode
    (the default), which doesn't journal data, but which does make sure
    that data is flushed to disk before the transaction which allocated
    that data is allowed to commit.  That gives you most of the
    performance of ext3's fast-and-loose writeback mode, but with an
    absolute guarantee that you never see stale blocks in a file after a
    crash.

I've been getting a stream of private mail over the last few days
saying one thing or another about various filesystems with various
optional patches, so let me get this out in the open and see if we can
converge on an answer here.  [ext2f2, ext3fs, and reiserfs answers
should feel free to cite which mode they're talking about and URLs for
whatever patches are required to get to that mode; some impressions
about reliability and maturity would be useful, too.]

Let's take this scenario:  Files A and B have had blocks written to
them sometime in the recent past (30 to 60 seconds or so) and a sync
has not happened yet.  (I don't know how often reiserfs will be synced
by default; 60 seconds?  Longer?  Presumably running "sync" will force
it, but I don't know when else it will happen.)  File A may have been
completely rewritten or newly written (e.g., what Emacs does when it
saves a file), whereas file B may have simply been appended to (e.g.,
what happens when wtmp is updated).

The CPU reset button is then pushed.  [See P.P.S. at end of this message.]

Now, we have the following possibilities for the outcome after the
system comes back up and has finished checking its filesystem:

(a) Metadata correctly written, file data correctly written.
(b) Metadata correctly written, file data partially written.
    (E.g., one or both files might have been partially or completely
    updated.) 
(c) Metadata correctly written, file data completely unwritten.
    (Neither file got updated at all.)
(d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B.
    (E.g., File A gets some of file B written somewhere within it,
    and file B gets some of file A written somewhere within it---this
    is the behavior I observed, at least twice, with reiserfs.)
(e) Metadata corrupted in some fashion, file data undefined.
    ("Undefined" means could be any of (a) through (d) above; I don't care.)

Now, which filesystems can show each outcome?  I don't know.  I
contend that reiserfs does (d).  Stephen Tweedie talks above about
whether we can "guarantee that all the data blocks have been written",
but may be missing the point I was making, namely that THE BLOCKS HAVE
BEEN WRITTEN TO THE WRONG FILES.

It would be nice to know, for each of ext2fs, ext3fs, and reiserfs,
what the -intended- outcome is, and what the -actual- outcome is
(since implementation bugs might make the actual outcome different
from the intended outcome).  Any additional filesystems anyone would
like to toss into the pot would be welcome; maybe I'll post a matrix
of the results, if we get some.

I'm -assuming- that the intended outcome for reiserfs (without data
journalling) is one of (a), (b), or (c).  If the intended outcome for
reiserfs without data journalling [or -any- FS, really] is in fact
(d), then I don't understand how this filesystem can be intended for
any reliable service, since a failure will garble all files written in
the last several seconds in a fashion that is very, very difficult to
unscramble.  (-Perhaps-, if all the metadata is indeed correct, it
would be possible to at least -identify- which files may have gotten
smashed, by looking for all files whose mtime or ctime is in the last
60 seconds (more?) before the failure, but they'd still be trashed in
bizarre ways---it's much easier to fix a file (particularly a text
file) that is simply out of date (having had only some, or none, of
its recent data written) then it is to fix one that's had data from
other file(s) added to it in random places.  Furthermore, files such
as wtmp will probably get their mtime modified the instant the system
comes back up, further muddying the waters.)

Can someone(s) help to address the above?  And, even better, could
this information be placed prominently on the web pages describing the
relevant file systems?  It would be extremely useful for people trying
to decide which one to run to know this -before- they have committed
umpteen gigabytes to one or the other and -then- get bitten.

Thanks!

P.S.  Nikita Danilov said that there is a data-journalling patch to
reiserfs written Chris Mason <Mason@Suse.COM>, but has not responded
with a URL to it; can someone (or Chris? now CC'ed) do so?  A search
for reiserfs and mason is useless, yielding 12,000 hits.  (I'm
particularly interested in one for reiserfs 3.6.25 and Mandrake 8.0,
but I assume there may be several variants in the same repository.)
Benchmarking data on the performance impact of data journalling for
reiserfs, ext3fs, and anything else anyone cares to supply would
probably be useful to lots of people at well.

P.P.S.  I say reset and not power-off, although I hope that this is
moot, because I presume that the unsynced data, by virtue of being
unsynced, is nowhere near the disk datapaths anyway.  But either way,
a reset should let the disks continue to write data out of their write
buffers, assuming that a CPU reset doesn't flush such pending
transactions; I don't know if there's some IDE bus sequence that can
do this, and whether CPU reset would issue such a sequence.  It may
not matter; is it common that disks might leave data buffered but
unwritten for 30 seconds if there is no other disk activity?  I would
hope that this is -not- true and that the buffered data is buffered
only while there is other activity, since failing to flush the buffer
when the disk is idle only increases the risk of losing it without
improving performance at all.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] ReiserFS data corruption in very simple configuration
  2001-09-29  4:44       ` Lenny Foner
@ 2001-09-29 12:52         ` Lehmann 
  2001-10-01  1:00           ` foner-reiserfs
  2001-10-01 11:30         ` Stephen C. Tweedie
  1 sibling, 1 reply; 24+ messages in thread
From: Lehmann  @ 2001-09-29 12:52 UTC (permalink / raw)
  To: Lenny Foner; +Cc: sct, Nikita, Mason, linux-kernel, reiserfs-list

On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner <foner-reiserfs@media.mit.edu> wrote:
> isn't fixed by fsck?  [See outcome (d) below.]  I'm having difficulty
> believing how this can be possible for a non-journalling filesystem.

If you have difficulties in believing this, may I ask you how you think it
is possible for a non-journaling filesystem to prevent this at all?

> But what about written to the wrong files?  See below.

What you see is most probably *old* data, not data from another (still
existing) file.

> has not happened yet.  (I don't know how often reiserfs will be synced
> by default; 60 seconds?  Longer?  Presumably running "sync" will force

mostly like with any other filesystem (man bdflush)

> Now, we have the following possibilities for the outcome after the

> (a) Metadata correctly written, file data correctly written.

all filesystems ;)

> (b) Metadata correctly written, file data partially written.
>     (E.g., one or both files might have been partially or completely
>     updated.) 

ext2, reiserfs.

> (c) Metadata correctly written, file data completely unwritten.
>     (Neither file got updated at all.)

ext2, reiserfs.
   
> (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B.

this shouldn't happen on reiserfs. however, the unwritten parts of file a can easily
contain data formerly in file b.

> (e) Metadata corrupted in some fashion, file data undefined.
>     ("Undefined" means could be any of (a) through (d) above; I don't care.)

this should be prevented by journaling (of course, this won't help against
harddisk failures) on reiserfs. ext2 often has this problem, but fsck usually
can repair it. it's easy to tell metadata from filedata on ext2.

> whether we can "guarantee that all the data blocks have been written",
> but may be missing the point I was making, namely that THE BLOCKS HAVE
> BEEN WRITTEN TO THE WRONG FILES.

remember that the blocks have previous content, and reiserfs' tails
optimization means that files appended all the time (wtmp) can move around
rapidly (at least their tail).

> P.P.S.  I say reset and not power-off, although I hope that this is
> moot, because I presume that the unsynced data, by virtue of being
> unsynced, is nowhere near the disk datapaths anyway.

this can make a big difference. many disks (ibm, maxtor) nowadays write
partial blocks on power outage, this gives "Uncorrectable read errors",
which is fatal, because no filesystem so far can work around this. It's
easy to repair (just rewrite the block), but would requite filesystem
feedback (hey, reisrefs, this metadata block is *garbage*).

> a reset should let the disks continue to write data out of their write
> buffers, assuming that a CPU reset doesn't flush such pending

they should, yes. OTOH, ide disks are cheap...

> not matter; is it common that disks might leave data buffered but
> unwritten for 30 seconds if there is no other disk activity?  I would

no. and it doesn't make sense. but it's not forbidden or sth.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [reiserfs-list] ReiserFS data corruption in very simple configuration
  2001-09-29 12:52         ` [reiserfs-list] " Lehmann 
@ 2001-10-01  1:00           ` foner-reiserfs
  2001-10-01  1:26             ` Lehmann 
  0 siblings, 1 reply; 24+ messages in thread
From: foner-reiserfs @ 2001-10-01  1:00 UTC (permalink / raw)
  To: pcg; +Cc: sct, Nikita, Mason, linux-kernel, reiserfs-list, foner-reiserfs

    Date: Sat, 29 Sep 2001 14:52:29 +0200
    From: <pcg@goof.com ( Marc) (A.) (Lehmann )>

Thanks for your response!  Bear with me, though, because I'm asking
a design question below that relates to this.

    On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner <foner-reiserfs@media.mit.edu> wrote:
    > isn't fixed by fsck?  [See outcome (d) below.]  I'm having difficulty
    > believing how this can be possible for a non-journalling filesystem.

    If you have difficulties in believing this, may I ask you how you think it
    is possible for a non-journaling filesystem to prevent this at all?

Naively, one would assume that any non-journalling FS that has written
correct metadata through to the disk would either have written updates
into files, or failed to write them, but would not have written new
(<60 second old) data into different files than the data was destined for.
(I suppose the assumption I'm making here is that, when creating or
extending a file, the metadata is written -last-, e.g., file blocks
are allocated, file data is written, and -then- metadata is written.
That way, a failure anywhere before finality simply seems to vanish,
whereas writing metadata first seems to cause the lossage below.)

    > But what about written to the wrong files?  See below.

    What you see is most probably *old* data, not data from another (still
    existing) file.

I'm...  dubious, but maybe.  As mentioned earlier in this thread,
one of the failures I saw consisted of having several lines of my
XFree86.0.log file appended to wtmp---when I logged in after the
failure, I got "Last login: " followed by several lines from that file
instead of a date.  (Other failures scrambled other files worse.)

Now, it's -possible- that rsfs allocated an extra portion to the end
of wtmp for the last-login data (as a user of the fs, I don't care
whether officially this was a "block", an entry in a journal, etc),
login "wrote" to that region (but it wasn't committed yet 'cause no
sync), my XFree86.0.log file was "created" and "written" (again
uncommitted), I pushed reset, and then when it came back up, the end
of wtmp had data from the -previous- copy of XFree86.0.log that had
been freed (because it was unlinked when the next copy was written)
but which had not actually had the wtmp data written to it yet
(because a sync hadn't happened).  I have no way to verify this, since
one XFree86.0.log looks much like the other.  Conceptually, this would
imply that wtmp was extended into disk freespace, which just happened
to have that logfile in it (instead of zero bytes).  Is this what
you're talking about when you say "*old* data"?  I think so, and that
seems to match your comment below about file tails moving around
rapidly.

But it doesn't explain -why- it works this way in the first place.
Wouldn't it make more sense to commit metadata to disk -after- the
data blocks are written?  After all, if -either one- isn't written,
the file is incomplete.  But if the metadata is written -last-, the
file simply looks like the data was never added.  If the metadata is
written -first-, the file can scoop up random trash from elsewhere in
the filesystem.  I contend that this is -much- worse, because it can
render a previously-good file completely unparseable by tools that
expect that -all- of the file is in a particular syntax.  It's just
an accident, I guess, that login will accept any random trash when
it prints its "last-login" message, rather than falling over with a
coredump because it doesn't look like a date.  [And see * below.]

Unfortunately, this behavior meant that X -did- fall over, because my
XF86Config file was trashed by being scrambled---I'd recently written
out a new version, after all---and the trashed copy no longer made any
sense.  I would have been -much- happier to have had the -unmodified-,
-old- version than a scrambled "new" version!  Without Emacs ~ files,
this would have been much worse.  Consider an app that, "for reliability",
rewrites a file by creating a temp copy, writing it out, then renaming
the temp over the original [this is how Emacs typically saves files].
But if you write the metadata first, you foil this attempt to be safe,
because you might have this sequence at the actual disk:  [magnetic
oxide updated w/rename][start updating magnetic oxide with tempfile
data][power failure or reset]---ooops! original file gone, new file
doesn't have its data yet, so sorry, thanks for playing.

By writing metadata first, it seems that reiserfs violates the
idempotence of many filesystem operations, and does exactly the
opposite of what "journalling" implies to anyone who understands
databases, namely that either the operation completes entirely, or it
is completely undone.  Yes, yes, I know (now!) that it claims to only
journal the metadata, but how does this help when what it's essentially
doing is trashing the -data- in unexpected ways exactly when such
journalling is supposed to help, namely across a machine failure?

This seems like such an elementary design defect that I'm at a loss
to understand why it's there.  There -must- be some excellent reason,
right?  But what?  And if not, can it be fixed?

I'm also still waiting to find out how to make reiserfs actually
journal its data, and what the performance implications of this are.
No one has responded with a URL.

[*] It's also a security hole.  If I want to read a file that I'm not
authorized to read, -but- I can cause a kernel panic (or a blackout!),
then I can craftily wait until up to several seconds after the
"secure" file is being rewritten (presumably via the write-tempfile-
and-relink method), create a big file of my own, and force the
panic---my file may then get some of the secure blocks from the old
copy.  And, unlike filesystems that write metadata last, the "secure"
program can't just zero out the blocks of the file it's about to
unlink, because---since metadata is written first---those zeroes won't
have made it to disk yet even though the blocks have been declared
free and included in my file.  I now know what's in your file.
Whoops.  And this is such an enormous timing hole that I can write a
program that just checks every 5 seconds or so for a new copy of the
secure file, -then- forces the failure---I need not get the timing
very good, as long as it's likely that I'll do so before the next
sync.  It's so bad that, even if I can't force a panic, my program
can just beep and I'll immediately go short out the outlet that
happens to be on the same circuit as the machine I'm attacking.

    [ . . . ]

    > (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B.

    this shouldn't happen on reiserfs. however, the unwritten parts of file a can easily
    contain data formerly in file b.

Then why allow metadata to be written first instead of last?

    > (e) Metadata corrupted in some fashion, file data undefined.
    >     ("Undefined" means could be any of (a) through (d) above; I don't care.)

    this should be prevented by journaling (of course, this won't help against
    harddisk failures) on reiserfs. ext2 often has this problem, but fsck usually
    can repair it. it's easy to tell metadata from filedata on ext2.

    > whether we can "guarantee that all the data blocks have been written",
    > but may be missing the point I was making, namely that THE BLOCKS HAVE
    > BEEN WRITTEN TO THE WRONG FILES.

    remember that the blocks have previous content, and reiserfs' tails
    optimization means that files appended all the time (wtmp) can move around
    rapidly (at least their tail).

    [ . . . ]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] ReiserFS data corruption in very simple configuration
  2001-10-01  1:00           ` foner-reiserfs
@ 2001-10-01  1:26             ` Lehmann 
  2001-10-01  2:32               ` foner-reiserfs
  2001-10-03 16:28               ` Toby Dickenson
  0 siblings, 2 replies; 24+ messages in thread
From: Lehmann  @ 2001-10-01  1:26 UTC (permalink / raw)
  To: foner-reiserfs; +Cc: sct, Nikita, Mason, linux-kernel, reiserfs-list

On Sun, Sep 30, 2001 at 09:00:49PM -0400, foner-reiserfs@media.mit.edu wrote:
> extending a file, the metadata is written -last-, e.g., file blocks
> are allocated, file data is written, and -then- metadata is written.

this is almost impossible to achieve with existing hardware (witness the
many discussions about disk caching for example), and, without journaling,
might even be slow.

> of wtmp had data from the -previous- copy of XFree86.0.log that had
> been freed (because it was unlinked when the next copy was written)
> but which had not actually had the wtmp data written to it yet

It's easily possible, but it could also be a bug. Let's the reiserfs authors
decide.

However, if it is indeed "a bug" then fixing it would only lower the
frequency of occurance.

Only ext3 (some modes) + turning off your harddisk's cache can ensure
this, at the moment.

> to have that logfile in it (instead of zero bytes).  Is this what
> you're talking about when you say "*old* data"?  I think so, and that
> seems to match your comment below about file tails moving around
> rapidly.

appending to logfiles will result in a lot of movement. with other,
strictly block-based filesystems this occurs relatively frequent, and data
will not usually move around. with reiserfs tail movement is frequent.

> Wouldn't it make more sense to commit metadata to disk -after- the
> data blocks are written?

The problem is that there is currently no easy way to achieve that.

> file simply looks like the data was never added.  If the metadata is
> written -first-, the file can scoop up random trash from elsewhere in

Also, this is not a matter of metadata first or last. Sometimes you need
metadata first, sometimes you need it last. And in many cases, "metadata"
does not need to change, while data still changes.

> the filesystem.  I contend that this is -much- worse, because it can
> render a previously-good file completely unparseable by tools that
> expect that -all- of the file is in a particular syntax.

It depends - with ext2 you frequently have garbled files, too. Basically, if
you write to a file and turn off the power the outcome is unexpected, and
will always be (unless you are ready to take the big speed hit).

> Unfortunately, this behavior meant that X -did- fall over, because my
> XF86Config file was trashed by being scrambled---I'd recently written
> out a new version, after all---and the trashed copy no longer made any

But the same thing can and does happen with ext2, depending on your editor
and your timing. It is not a reiserfs thing.

> But if you write the metadata first, you foil this attempt to be safe,
> because you might have this sequence at the actual disk:  [magnetic
> oxide updated w/rename][start updating magnetic oxide with tempfile
> data][power failure or reset]---ooops! original file gone, new file
> doesn't have its data yet, so sorry, thanks for playing.

Of course. If you want data to hit the disk, you have to use fsync. This
does work with reiserfs and will ensure that the data hits the disk. If
you don't do this then bad things might happen.

> By writing metadata first, it seems that reiserfs violates the
> idempotence of many filesystem operations, and does exactly the
> opposite of what "journalling" implies to anyone who understands
> databases, namely that either the operation completes entirely, or it
> is completely undone.

You are confusing databases with filesystems, however. Most journaling
filesystems work that way. Some (like ext3) are nice enough to let you
choose.

> journal the metadata, but how does this help when what it's essentially
> doing is trashing the -data- in unexpected ways exactly when such
> journalling is supposed to help, namely across a machine failure?

But ext2 works in the same way. It does happen more often with reiserfs
(especially with tails), but ignoring the problem for ext2 doesn't make it
right. If applications don't work reliably with reisrefs, they don't work
reliably with ext2. If you want reliability then mount synchronous.

> This seems like such an elementary design defect that I'm at a loss
> to understand why it's there.

About every filesystem does have this "elementary design defect". If you
want data to hit the disk, sync it. Its that simple.

> There -must- be some excellent reason,
> right?  But what?  And if not, can it be fixed?

Speed is an excellent reason. The fix is to tell the kernel to write the
data out to the platters.

Anyway, this is a good time to review the various discussions on the
reiserfs list and the kernel list on how to teach the kernel (if it is
possible) to implement loose write-ordering.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [reiserfs-list] ReiserFS data corruption in very simple configuration
  2001-10-01  1:26             ` Lehmann 
@ 2001-10-01  2:32               ` foner-reiserfs
  2001-10-03 16:28               ` Toby Dickenson
  1 sibling, 0 replies; 24+ messages in thread
From: foner-reiserfs @ 2001-10-01  2:32 UTC (permalink / raw)
  To: pcg; +Cc: sct, Nikita, Mason, linux-kernel, reiserfs-list, foner-reiserfs

    Date: Mon, 1 Oct 2001 03:26:27 +0200
    From: <pcg@goof.com ( Marc) (A.) (Lehmann )>

    On Sun, Sep 30, 2001 at 09:00:49PM -0400, foner-reiserfs@media.mit.edu wrote:
    > extending a file, the metadata is written -last-, e.g., file blocks
    > are allocated, file data is written, and -then- metadata is written.

    this is almost impossible to achieve with existing hardware (witness the
    many discussions about disk caching for example), and, without journaling,
    might even be slow.

I think perhaps we may be talking past each other; let me try to clarify.

As I said earlier in this thread, this has nothing at all to do with
disk caching.  Let me restate this again:  The scenario I'm discussing
is an otherwise-idle machine that had 2 (maybe 3) files modified, sat
idle for 30-60 seconds, and then had the reset button pushed.  I would
expect that either file data and metadata got written, or neither got
written, but not metadata without file data.  This is repeatable more
or less at will---I didn't -just- happen to catch it -just- as it
decided to frob the disks.  Instead, the problem seems to be that
reiserfs is perfectly happy to update the on-disk representation of
which disk blocks contain which files' data, and then -sit there- for
a long time (a minute? longer?) without -also- attempting to flush the
file data to the disk.  This then leads to corrupted files after the
reset.  It's not that the CPU sent data to the disk subsystem that
failed to be written by the time of the interruption; it's that the
data was still sitting in RAM and the CPU hadn't even decided to get
it out the IDE channel yet.  This means that there is -always- a giant
timing hole which can corrupt data, as opposed to just the much-tinier
hole that would be created if the file-bytes-to-disk-bytes correspondence
were updated immediately after the write that wrote the data---it
would be hard for me to accidentally hit such a hole.

    > of wtmp had data from the -previous- copy of XFree86.0.log that had
    > been freed (because it was unlinked when the next copy was written)
    > but which had not actually had the wtmp data written to it yet

    It's easily possible, but it could also be a bug. Let's the reiserfs authors
    decide.

    However, if it is indeed "a bug" then fixing it would only lower the
    frequency of occurance.

True, but as long as it makes it only happen if the disk is -in
progress of writing stuff- when the reset or power failure happens,
the risk is -greatly- reduced.  Right now, it's an enormous timing
hole, and one that's likely to be hit---it's happened to me -every
single time- I've had to hit the reset button because (for example)
I wedged X while debugging, and even if I waited a minute after the
wedge-up to do so!  The way I've avoided it is by running a job that
syncs once a second while doing debugging that might possibly make me
unable to take the machine down cleanly.  This is a disgusting and
unreliable kluge.

    Only ext3 (some modes) + turning off your harddisk's cache can ensure
    this, at the moment.

Or ext3 (some modes) + assuming that the disk will at least write data
that's been sent to it, even if the CPU gets reset.  (I know it's
hopeless if power fails, but that can be made arbitrarily unlikely,
compared to a kernel panic or having to do a CPU reset.)

    > to have that logfile in it (instead of zero bytes).  Is this what
    > you're talking about when you say "*old* data"?  I think so, and that
    > seems to match your comment below about file tails moving around
    > rapidly.

    appending to logfiles will result in a lot of movement. with other,
    strictly block-based filesystems this occurs relatively frequent, and data
    will not usually move around. with reiserfs tail movement is frequent.

Right.

    > Wouldn't it make more sense to commit metadata to disk -after- the
    > data blocks are written?

    The problem is that there is currently no easy way to achieve that.

Why not?  (Ignore the disk-caching issue and concentrate on when the
kernel asks for data to be written to the disk.  I am -assuming that
the kernel either (a) writes the data in the order requested, or at
least (b) once it decides to write anything, keeps sending it to the
disk until its queue is completely empty.)

    > file simply looks like the data was never added.  If the metadata is
    > written -first-, the file can scoop up random trash from elsewhere in

    Also, this is not a matter of metadata first or last. Sometimes you need
    metadata first, sometimes you need it last. And in many cases, "metadata"
    does not need to change, while data still changes.

I'm using "metadata" here as a shorthand for "how the filesystem knows
which byte on disk corresponds to which byte in the file", not just
things like atime, ctime, etc.

    > the filesystem.  I contend that this is -much- worse, because it can
    > render a previously-good file completely unparseable by tools that
    > expect that -all- of the file is in a particular syntax.

    It depends - with ext2 you frequently have garbled files, too. Basically, if
    you write to a file and turn off the power the outcome is unexpected, and
    will always be (unless you are ready to take the big speed hit).

    > Unfortunately, this behavior meant that X -did- fall over, because my
    > XF86Config file was trashed by being scrambled---I'd recently written
    > out a new version, after all---and the trashed copy no longer made any

    But the same thing can and does happen with ext2, depending on your editor
    and your timing. It is not a reiserfs thing.

Well, I've gotten several pieces of private mail from people
complaining that it's happening a lot more with reiserfs.  And
I've never been bitten this way in years of ext2 usage.

    > But if you write the metadata first, you foil this attempt to be safe,
    > because you might have this sequence at the actual disk:  [magnetic
    > oxide updated w/rename][start updating magnetic oxide with tempfile
    > data][power failure or reset]---ooops! original file gone, new file
    > doesn't have its data yet, so sorry, thanks for playing.

    Of course. If you want data to hit the disk, you have to use fsync. This
    does work with reiserfs and will ensure that the data hits the disk. If
    you don't do this then bad things might happen.

It's that I either want the data to hit the disk, or -not- to hit
the disk, but not to partially-update files such that things are
inconsistent even when the disk has been idle for 20 seconds
and the system isn't doing anything else.  It's even worse in
that the filesystem -believes- itself to be accurate, even though
the data it's actually storing is scrambled.

    > By writing metadata first, it seems that reiserfs violates the
    > idempotence of many filesystem operations, and does exactly the
    > opposite of what "journalling" implies to anyone who understands
    > databases, namely that either the operation completes entirely, or it
    > is completely undone.

    You are confusing databases with filesystems, however. Most journaling
    filesystems work that way. Some (like ext3) are nice enough to let you
    choose.

I am deliberately talking about databases, because the terminology and
technology of journalling came from there.  Using the term "journalling"
and then behaving very differently from the way it's used in database
design is misleading at best.  Several people who've written to me
have said they felt "cheated" to discover that reiserfs didn't
actually journal the data or otherwise misbehaved in ways similar
to my problem here.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-09-26 14:43     ` Stephen C. Tweedie
@ 2001-10-01  3:38       ` Mike Fedyk
  2001-10-03 16:14         ` Stephen C. Tweedie
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Fedyk @ 2001-10-01  3:38 UTC (permalink / raw)
  To: linux-kernel

Hi,

On Wed, Sep 26, 2001 at 03:43:11PM +0100, Stephen C. Tweedie wrote:
> On Tue, Sep 25, 2001 at 01:13:04PM -0700, Mike Fedyk wrote:
> > If you have data journaling, does that mean there is a possability of
> > recovering a complete file -before- it was written?  i.e:
> 
> > echo a > test;
> > sync;
> > cat picture.tif > test
> > (writing in progress, only partially in journal)
> > power off
>  
> > Will "a" be in test upon recovery?
> 
> If you are using full data journaling (ext3's "journal" data mode) or
> the default "ordered" data mode, then no, you never see such
> behaviour.
>

At this point, it looks like I'm going to get a partial picture.tif in test
after recovery...

> In the ordered mode, it achieves this precisely because it is keeping
> a record of which blocks have been committed (or, more accurately,
> which *deleted* blocks have had the delete committed).  If you do a
> "cat > file", then before the new data is written, the file gets
> truncated and all its old data blocks deleted.  ext3 will then refuse
> to reuse those blocks until the delete has been committed, so if we
> crash and end up rolling back the delete transaction, we'll never see
> new data blocks in the old file.
>

Now, it looks like I'll end up with "a" in test...  

>From what you're describing, it looks like the contents of test after a
truncate won't be overwritten by another transaction until the deletion of
those blocks has made it to disk...  So, while in ordered, or journal mode,
I'd end up with "a" in test, but with writeback mode there is no such
guarantee.

Am I missing something?

Are there any known cases where ext3 will not be able to recover pervious
data when a write wasn't able to complete?

> Cheers,
>  Stephen

Mike

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-09-29  4:44       ` Lenny Foner
  2001-09-29 12:52         ` [reiserfs-list] " Lehmann 
@ 2001-10-01 11:30         ` Stephen C. Tweedie
  1 sibling, 0 replies; 24+ messages in thread
From: Stephen C. Tweedie @ 2001-10-01 11:30 UTC (permalink / raw)
  To: Lenny Foner; +Cc: sct, linux-kernel, reiserfs-list

Hi,

On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner wrote:

>     Not true.  ext2, ext3 in its "data=writeback" mode, and reiserfs can
>     all demonstrate this behaviour.  Reiserfs is being no worse than ext2
>     (the timings may make the race more or less likely in reiserfs, but
>     ext2 _is_ vulnerable.)
> 
> ext2fs can write parts of file A to file B, and vice versa, and this
> isn't fixed by fsck?

No, we're not talking about incorrect writes, but *incomplete* writes,
which is a totally different thing.  An ext2 write of new data
involves many steps: the inode needs to be written to mark the file's
new size, the indirect mapping block[s] may have to be written to
record where the data is, and the data blocks themselves need to be
written.

Not only that, but a delete also requires multiple writes.  If you
delete a file and rapidly create a new one, then the image of the
filesystem in cache remains totally consistent, but the copy on disk
is updated incrementally and if you crash before the entire image is
updated, you can end up seeing both bits of the old file that was in
the process of being deleted, and the new file that was being created.

In addition, journaling prevents metadata inconsistencies from
occuring due to incomplete writes, but on its own, metadata journaling
doesn't mean that the data blocks are also in sync --- the disk blocks
describing a new file might be on disk, but the data blocks that the
file contains might not be.  Reiserfs, and also ext3 in its fastest
"writeback" mode, both behave like this (but ext3's other modes order
data writes so that this situation never happens: data blocks are
always flushed to disk before the metadata is committed.)

>     e2fsck only restores metadata consistency on ext2 after a crash: it
>     can't possibly guarantee that all the data blocks have been written.
> 
> But what about written to the wrong files?  See below.

See above.  If all the metadata is intact, how can e2fsck *possibly*
detect whether a data block contains the old or the new contents of
the block?

> Let's take this scenario:  Files A and B have had blocks written to
> them sometime in the recent past (30 to 60 seconds or so) and a sync
> has not happened yet.  (I don't know how often reiserfs will be synced
> by default; 60 seconds?  Longer?  Presumably running "sync" will force
> it, but I don't know when else it will happen.)  File A may have been
> completely rewritten or newly written (e.g., what Emacs does when it
> saves a file), whereas file B may have simply been appended to (e.g.,
> what happens when wtmp is updated).
> 
> The CPU reset button is then pushed.  [See P.P.S. at end of this message.]
> 
> Now, we have the following possibilities for the outcome after the
> system comes back up and has finished checking its filesystem:
> 
> (a) Metadata correctly written, file data correctly written.
> (b) Metadata correctly written, file data partially written.
>     (E.g., one or both files might have been partially or completely
>     updated.) 
> (c) Metadata correctly written, file data completely unwritten.
>     (Neither file got updated at all.)
> (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B.
>     (E.g., File A gets some of file B written somewhere within it,
>     and file B gets some of file A written somewhere within it---this
>     is the behavior I observed, at least twice, with reiserfs.)
> (e) Metadata corrupted in some fashion, file data undefined.
>     ("Undefined" means could be any of (a) through (d) above; I don't care.)
> 
> Now, which filesystems can show each outcome?  I don't know.  I
> contend that reiserfs does (d).  Stephen Tweedie talks above about
> whether we can "guarantee that all the data blocks have been written",
> but may be missing the point I was making, namely that THE BLOCKS HAVE
> BEEN WRITTEN TO THE WRONG FILES.

For ext3, (d) will never happen in this case.  You can only get
"wrong" data blocks if one of the files is being *deleted*, and its
blocks have been allocated to a new file, and the handover of those
blocks is incomplete at the time of the crash.

ext3 will only give you (a) (both metadata and data correctly written)
or (f) (neither have yet been written at all) if it is running in
ordered or data-journaling mode.  (b) and (c) are possible only if you
are in writeback mode.  (d) and (e) never happen if you're creating
two files, although in writeback mode (d) is possible if, say, you are
deleting A and writing B at the same time (the other ext3 modes
prevent this scenario too.)

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-09-22 10:00 ReiserFS data corruption in very simple configuration foner-reiserfs
  2001-09-22 12:47 ` Nikita Danilov
@ 2001-10-01 15:27 ` Hans Reiser
  2001-10-03 16:17   ` Stephen C. Tweedie
  1 sibling, 1 reply; 24+ messages in thread
From: Hans Reiser @ 2001-10-01 15:27 UTC (permalink / raw)
  To: foner-reiserfs; +Cc: linux-kernel

This is the meaning of metadata journaling: that writes in progress at the time
of the crash may write garbage, but you won't need to fsck.  You can get this
behaviour with other filesystems like FFS also.  If you cannot accept those
terms of service, you might use ext3 with data journaling on, but then your
performance will be far worse.  It is a tradeoff, not a bug.  Regarding where to
email these types of reiserfs questions, you might email
reiserfs-list@namesys.com with such questions, or try
www.namesys.com/support.html if you want paid support service on it.

Best,

Hans

foner-reiserfs@media.mit.edu wrote:
> 
> [Please CC me on any replies; I'm not on linux-kernel.]
> 
> The ReiserFS that comes with both Mandrake 7.2 and 8.0 has
> demonstrated a serious data corruption problem, and I'd like
> to know (a) if anyone else has seen this, (b) how to avoid it,
> and (c) how to determine how badly I've been bitten.
> 
> My configuration in each case has been an AMD CPU running ReiserFS
> exactly as configured "out of the box" by running the Mandrake 7.2 or
> 8.0 installation CD and opting to run ReiserFS instead of the default.
> This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID
> or anything fancy like that.  The hardware itself is rock solid and
> has never demonstrated any faults at all.  (MDK 8.0 appears to use
> RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.)
> The machine had barely been used before each corruption problem; I'm
> not running some strange root-priv stuff, and each time, the FS hadn't
> had more than a few minutes to a few hours of use since being created.
> 
> In each case, I've gotten in trouble by editing my XF86Config-4 file,
> guessing wrong on a modeline, hanging X (blank gray screen & no
> response to anything), and being forced to hit the reset button
> because nothing else worked.  Under 7.2, I discovered that my
> XF86Config-4 file suddenly had a block of nulls in it.  That time, I
> thought I must have been hallucinating, but I ran a background job to
> sync the filesystem every second while continuing to debug the X
> problems, and didn't see the corruption again.
> 
> Now, I was just bitten by the -same- behavior under MDK 8.0.  After
> accidentally hanging X, I waited a few seconds just in case a sync was
> pending, hit reset, and had all sorts of lossage:
>   (1) Parts of the XF86Conf-4 file had lines garbled, e.g.,
>       sections of the file had apparently been rearranged.
>   (2) /var/log/XFree86.0.log was truncated, and maybe garbled.
>   (2) Logging in as root was fine, but then logging in as myself
>       I got "Last login: <4-5 lines of my XFree86.0.log file (!)>"
>       instead of a date!  Logging in again gave me the proper
>       last-login time, but clearly wtmp or something else had
>       gotten stepped on in some weird way.
> Obviously, the behavior I saw once under MDK 7.2 was no hallucination
> or accidental yank in Emacs.
> 
> I thought the whole point of a journalling file system was to
> -prevent- corruption due to an unexpected failure!  This seems to be
> -far- worse than a normal filesystem---ext2fs would at least choke and
> force fsck to be run, which might actually fix the problem, but this
> is ridiculous---it just silently trashes random files.
> 
> So I now have possibly-undetected filesystem damage.  My -guess- is
> that only files written within a few minutes of the reset are likely
> to be affected, but I really don't know, and don't know of a good way
> to find out.  Must I reinstall the OS -again-, starting from a blank
> partition, to be sure?  Maybe I should just give up on ReiserFS completely.
> 
> [If there is a more-appropriate place for me to send this---such as
> a particular Mandrake list, or a particular ReiserFS list---please let
> me know, particularly if I can get a quick answer -without- going
> through the overhead of subscribing to the list, being flooded, and
> unsubscribing---that's what archives are for.  Some websearching
> for "ReiserFS corruption" yielded -thousands- of hits---not a good
> sign---and a very large proportion of them were on this list, so I
> figure this is as good a place to ask as any.  Thanks again.]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-10-01  3:38       ` Mike Fedyk
@ 2001-10-03 16:14         ` Stephen C. Tweedie
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen C. Tweedie @ 2001-10-03 16:14 UTC (permalink / raw)
  To: linux-kernel

Hi,

On Sun, Sep 30, 2001 at 08:38:31PM -0700, Mike Fedyk wrote:
 
> >From what you're describing, it looks like the contents of test after a
> truncate won't be overwritten by another transaction until the deletion of
> those blocks has made it to disk...  So, while in ordered, or journal mode,
> I'd end up with "a" in test, but with writeback mode there is no such
> guarantee.
> 
> Am I missing something?
> 
> Are there any known cases where ext3 will not be able to recover pervious
> data when a write wasn't able to complete?

It depends on what the application is doing.  Applications often open
an existing file with O_TRUNC, write to it, then close it.  If you
crash between the truncate and the write being committed, then you'll
get a perfectly legal, sane, consistent, empty file on recovery.

--Stephen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-10-01 15:27 ` Hans Reiser
@ 2001-10-03 16:17   ` Stephen C. Tweedie
  2001-10-03 20:06     ` Pascal Schmidt
  0 siblings, 1 reply; 24+ messages in thread
From: Stephen C. Tweedie @ 2001-10-03 16:17 UTC (permalink / raw)
  To: Hans Reiser; +Cc: foner-reiserfs, linux-kernel, Stephen Tweedie

Hi,

On Mon, Oct 01, 2001 at 07:27:31PM +0400, Hans Reiser wrote:
> This is the meaning of metadata journaling: that writes in progress at the time
> of the crash may write garbage, but you won't need to fsck.  You can get this
> behaviour with other filesystems like FFS also.  If you cannot accept those
> terms of service, you might use ext3 with data journaling on, but then your
> performance will be far worse.

ext3 with ordered data writes has performance nearly up to the level
of the fast-and-loose writeback mode for most workloads, and still
avoids ever exposing stale disk blocks after a crash.

Sure, it's a tradeoff, but there are positions between the two
extremes (totally unordered data writes, and totally journaled data
writes) which offer a good compromise here.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] ReiserFS data corruption in very simple configuration
  2001-10-01  1:26             ` Lehmann 
  2001-10-01  2:32               ` foner-reiserfs
@ 2001-10-03 16:28               ` Toby Dickenson
  1 sibling, 0 replies; 24+ messages in thread
From: Toby Dickenson @ 2001-10-03 16:28 UTC (permalink / raw)
  To: pcg; +Cc: foner-reiserfs, sct, Nikita, Mason, linux-kernel, reiserfs-list

>Of course. If you want data to hit the disk, you have to use fsync. This
>does work with reiserfs and will ensure that the data hits the disk. If
>you don't do this then bad things might happen.

This is probably a naive question, but this thread has already proved
me wrong on one naive assumption.....

If the sequence is:
1. append some data to file A
2. fsync(A)
3. append some further data to A
4. some writes to other files
5. power loss

Is it guaranteed that all the data written in step 1 will still be
intact?

The potential problem I can see is that some data from step 1 may have
been written in a tail, the tail moves during step 3, and then the
original tail is overwritten before the new tail (including data from
before the fsync) is safely on disk.

Thanks for your help,


Toby Dickenson
tdickenson@geminidataloggers.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-10-03 16:17   ` Stephen C. Tweedie
@ 2001-10-03 20:06     ` Pascal Schmidt
  2001-10-04 11:02       ` Stephen C. Tweedie
  0 siblings, 1 reply; 24+ messages in thread
From: Pascal Schmidt @ 2001-10-03 20:06 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel

On Wed, 3 Oct 2001, Stephen C. Tweedie wrote:

> ext3 with ordered data writes has performance nearly up to the level
> of the fast-and-loose writeback mode for most workloads, and still
> avoids ever exposing stale disk blocks after a crash.
What if the machine crashes with parts of the data blocks written to
disk, before the commit block gets submitted to the drive?

The journal will tell us that the write transaction hasn't finished, but
that doesn't mean that no data blocks made it to disk, right? We won't
expose stale disk blocks, right, but there is still a mix between new and
old file data in this situation. I assume e2fsck will warn about this?

-- 
Ciao, Pascal

-<[ pharao90@tzi.de, netmail 2:241/215.72, home http://cobol.cjb.net/) ]>-


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ReiserFS data corruption in very simple configuration
  2001-10-03 20:06     ` Pascal Schmidt
@ 2001-10-04 11:02       ` Stephen C. Tweedie
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen C. Tweedie @ 2001-10-04 11:02 UTC (permalink / raw)
  To: Pascal Schmidt; +Cc: Stephen C. Tweedie, linux-kernel

Hi,

On Wed, Oct 03, 2001 at 10:06:58PM +0200, Pascal Schmidt wrote:
> On Wed, 3 Oct 2001, Stephen C. Tweedie wrote:
> 
> > ext3 with ordered data writes has performance nearly up to the level
> > of the fast-and-loose writeback mode for most workloads, and still
> > avoids ever exposing stale disk blocks after a crash.
> What if the machine crashes with parts of the data blocks written to
> disk, before the commit block gets submitted to the drive?

In most cases, users write data by extending off the end of a file.
Only in a few cases (such as databases) do you ever write into the
middle of an existing file.  Even overwriting an existing file is done
by first truncating the file and then extending it again.

If you crash during such an extend, then the data blocks may have been
partially written, but the extend will not have been, so the
incompletely-written data blocks will not be part of any file.

The *only* way to get mis-ordered data blocks in ordered mode after a
crash is if you are overwriting in the middle of an existing file.  In
such a case there is no absolute guarantee about write ordering unless
you use fsync() or O_SYNC to force writes in a particular order.  

In journaled data mode, even mid-file overwrites will be strictly
ordered after a crash.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple configuration
  2001-09-24  9:25   ` [reiserfs-list] " Jens Benecke
@ 2001-10-14 14:52     ` Chris Mason
  2001-10-14 18:19       ` Jens Benecke
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2001-10-14 14:52 UTC (permalink / raw)
  To: Jens Benecke, linux-kernel, Reiserfs mail-list



On Monday, September 24, 2001 11:25:10 AM +0200 Jens Benecke
<jens@jensbenecke.de> wrote:

> one question:
> 
> When I was using ext2 I always mounted the /usr partition read-only, so
> that a fsck weren't necessary at boot - and the files were all guaranteed
> to be OK to bring the system up at least.
> 
> Does this (mount -o ro) make sense with ReiserFS as well? What I mean is,
> is there a chance of a file getting corrupted that was only *read* (not
> *written*) at or before a power outage?

Yes, after the mount is finished, reiserfs won't change the files on a
readonly mount.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple configuration
  2001-10-14 14:52     ` Chris Mason
@ 2001-10-14 18:19       ` Jens Benecke
  2001-10-14 20:04         ` Hans Reiser
  2001-10-14 23:32         ` Bernd Eckenfels
  0 siblings, 2 replies; 24+ messages in thread
From: Jens Benecke @ 2001-10-14 18:19 UTC (permalink / raw)
  To: linux-kernel, Reiserfs mail-list

[-- Attachment #1: Type: text/plain, Size: 976 bytes --]

On Sun, Oct 14, 2001 at 10:52:54AM -0400, Chris Mason wrote:
 
> > When I was using ext2 I always mounted the /usr partition read-only, so
> > that a fsck weren't necessary at boot - and the files were all
> > guaranteed to be OK to bring the system up at least.
> > 
> > Does this (mount -o ro) make sense with ReiserFS as well? What I mean
> > is, is there a chance of a file getting corrupted that was only *read*
> > (not *written*) at or before a power outage?
> 
> Yes, after the mount is finished, reiserfs won't change the files on a
> readonly mount.

What I meant is this: AFAIK, if you exclude broken hardware, in ext2 there
is no chance of a file that was never written to since mounting being
corrupted on a crash, even if the fs was mounted read-write.

Is this the same thing with ReiserFS?


-- 
Jens Benecke ········ http://www.hitchhikers.de/ - Europas Mitfahrzentrale

Crypto regulations will only hinder criminals who obey the law.

[-- Attachment #2: Type: application/pgp-signature, Size: 240 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple  configuration
  2001-10-14 18:19       ` Jens Benecke
@ 2001-10-14 20:04         ` Hans Reiser
  2001-10-14 23:32         ` Bernd Eckenfels
  1 sibling, 0 replies; 24+ messages in thread
From: Hans Reiser @ 2001-10-14 20:04 UTC (permalink / raw)
  To: Jens Benecke; +Cc: linux-kernel, Reiserfs mail-list

Jens Benecke wrote:
> What I meant is this: AFAIK, if you exclude broken hardware, in ext2 there
> is no chance of a file that was never written to since mounting being
> corrupted on a crash, even if the fs was mounted read-write.
> 
> Is this the same thing with ReiserFS?

Yes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple configuration
  2001-10-14 18:19       ` Jens Benecke
  2001-10-14 20:04         ` Hans Reiser
@ 2001-10-14 23:32         ` Bernd Eckenfels
  1 sibling, 0 replies; 24+ messages in thread
From: Bernd Eckenfels @ 2001-10-14 23:32 UTC (permalink / raw)
  To: linux-kernel

In article <20011014201907.H20001@jensbenecke.de> you wrote:
> What I meant is this: AFAIK, if you exclude broken hardware, in ext2 there
> is no chance of a file that was never written to since mounting being
> corrupted on a crash

Well, you can eighter lose the file due to a broken directory (maybe you
find the missing inode in lost+found) or it can even corrupt the file due to
a ext2 software error, which is unlikely but all filesystems in development
are reported to eat files every now and then.

Greetings
Bernd

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2001-10-14 23:32 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-09-22 10:00 ReiserFS data corruption in very simple configuration foner-reiserfs
2001-09-22 12:47 ` Nikita Danilov
2001-09-22 20:44   ` foner-reiserfs
2001-09-25 13:28     ` Stephen C. Tweedie
2001-09-29  4:44       ` Lenny Foner
2001-09-29 12:52         ` [reiserfs-list] " Lehmann 
2001-10-01  1:00           ` foner-reiserfs
2001-10-01  1:26             ` Lehmann 
2001-10-01  2:32               ` foner-reiserfs
2001-10-03 16:28               ` Toby Dickenson
2001-10-01 11:30         ` Stephen C. Tweedie
2001-09-24  9:25   ` [reiserfs-list] " Jens Benecke
2001-10-14 14:52     ` Chris Mason
2001-10-14 18:19       ` Jens Benecke
2001-10-14 20:04         ` Hans Reiser
2001-10-14 23:32         ` Bernd Eckenfels
2001-09-25 20:13   ` Mike Fedyk
2001-09-26 14:43     ` Stephen C. Tweedie
2001-10-01  3:38       ` Mike Fedyk
2001-10-03 16:14         ` Stephen C. Tweedie
2001-10-01 15:27 ` Hans Reiser
2001-10-03 16:17   ` Stephen C. Tweedie
2001-10-03 20:06     ` Pascal Schmidt
2001-10-04 11:02       ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox