* ReiserFS data corruption in very simple configuration
@ 2001-09-22 10:00 foner-reiserfs
2001-09-22 12:47 ` Nikita Danilov
2001-10-01 15:27 ` Hans Reiser
0 siblings, 2 replies; 24+ messages in thread
From: foner-reiserfs @ 2001-09-22 10:00 UTC (permalink / raw)
To: linux-kernel; +Cc: foner-reiserfs
[Please CC me on any replies; I'm not on linux-kernel.]
The ReiserFS that comes with both Mandrake 7.2 and 8.0 has
demonstrated a serious data corruption problem, and I'd like
to know (a) if anyone else has seen this, (b) how to avoid it,
and (c) how to determine how badly I've been bitten.
My configuration in each case has been an AMD CPU running ReiserFS
exactly as configured "out of the box" by running the Mandrake 7.2 or
8.0 installation CD and opting to run ReiserFS instead of the default.
This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID
or anything fancy like that. The hardware itself is rock solid and
has never demonstrated any faults at all. (MDK 8.0 appears to use
RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.)
The machine had barely been used before each corruption problem; I'm
not running some strange root-priv stuff, and each time, the FS hadn't
had more than a few minutes to a few hours of use since being created.
In each case, I've gotten in trouble by editing my XF86Config-4 file,
guessing wrong on a modeline, hanging X (blank gray screen & no
response to anything), and being forced to hit the reset button
because nothing else worked. Under 7.2, I discovered that my
XF86Config-4 file suddenly had a block of nulls in it. That time, I
thought I must have been hallucinating, but I ran a background job to
sync the filesystem every second while continuing to debug the X
problems, and didn't see the corruption again.
Now, I was just bitten by the -same- behavior under MDK 8.0. After
accidentally hanging X, I waited a few seconds just in case a sync was
pending, hit reset, and had all sorts of lossage:
(1) Parts of the XF86Conf-4 file had lines garbled, e.g.,
sections of the file had apparently been rearranged.
(2) /var/log/XFree86.0.log was truncated, and maybe garbled.
(2) Logging in as root was fine, but then logging in as myself
I got "Last login: <4-5 lines of my XFree86.0.log file (!)>"
instead of a date! Logging in again gave me the proper
last-login time, but clearly wtmp or something else had
gotten stepped on in some weird way.
Obviously, the behavior I saw once under MDK 7.2 was no hallucination
or accidental yank in Emacs.
I thought the whole point of a journalling file system was to
-prevent- corruption due to an unexpected failure! This seems to be
-far- worse than a normal filesystem---ext2fs would at least choke and
force fsck to be run, which might actually fix the problem, but this
is ridiculous---it just silently trashes random files.
So I now have possibly-undetected filesystem damage. My -guess- is
that only files written within a few minutes of the reset are likely
to be affected, but I really don't know, and don't know of a good way
to find out. Must I reinstall the OS -again-, starting from a blank
partition, to be sure? Maybe I should just give up on ReiserFS completely.
[If there is a more-appropriate place for me to send this---such as
a particular Mandrake list, or a particular ReiserFS list---please let
me know, particularly if I can get a quick answer -without- going
through the overhead of subscribing to the list, being flooded, and
unsubscribing---that's what archives are for. Some websearching
for "ReiserFS corruption" yielded -thousands- of hits---not a good
sign---and a very large proportion of them were on this list, so I
figure this is as good a place to ask as any. Thanks again.]
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: ReiserFS data corruption in very simple configuration 2001-09-22 10:00 ReiserFS data corruption in very simple configuration foner-reiserfs @ 2001-09-22 12:47 ` Nikita Danilov 2001-09-22 20:44 ` foner-reiserfs ` (2 more replies) 2001-10-01 15:27 ` Hans Reiser 1 sibling, 3 replies; 24+ messages in thread From: Nikita Danilov @ 2001-09-22 12:47 UTC (permalink / raw) To: foner-reiserfs; +Cc: linux-kernel, Reiserfs mail-list foner-reiserfs@media.mit.edu writes: > [Please CC me on any replies; I'm not on linux-kernel.] > > The ReiserFS that comes with both Mandrake 7.2 and 8.0 has > demonstrated a serious data corruption problem, and I'd like > to know (a) if anyone else has seen this, (b) how to avoid it, > and (c) how to determine how badly I've been bitten. > > My configuration in each case has been an AMD CPU running ReiserFS > exactly as configured "out of the box" by running the Mandrake 7.2 or > 8.0 installation CD and opting to run ReiserFS instead of the default. > This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID > or anything fancy like that. The hardware itself is rock solid and > has never demonstrated any faults at all. (MDK 8.0 appears to use > RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.) > The machine had barely been used before each corruption problem; I'm > not running some strange root-priv stuff, and each time, the FS hadn't > had more than a few minutes to a few hours of use since being created. > > In each case, I've gotten in trouble by editing my XF86Config-4 file, > guessing wrong on a modeline, hanging X (blank gray screen & no > response to anything), and being forced to hit the reset button > because nothing else worked. Under 7.2, I discovered that my > XF86Config-4 file suddenly had a block of nulls in it. That time, I > thought I must have been hallucinating, but I ran a background job to > sync the filesystem every second while continuing to debug the X > problems, and didn't see the corruption again. > > Now, I was just bitten by the -same- behavior under MDK 8.0. After > accidentally hanging X, I waited a few seconds just in case a sync was > pending, hit reset, and had all sorts of lossage: > (1) Parts of the XF86Conf-4 file had lines garbled, e.g., > sections of the file had apparently been rearranged. > (2) /var/log/XFree86.0.log was truncated, and maybe garbled. > (2) Logging in as root was fine, but then logging in as myself > I got "Last login: <4-5 lines of my XFree86.0.log file (!)>" > instead of a date! Logging in again gave me the proper > last-login time, but clearly wtmp or something else had > gotten stepped on in some weird way. > Obviously, the behavior I saw once under MDK 7.2 was no hallucination > or accidental yank in Emacs. > > I thought the whole point of a journalling file system was to > -prevent- corruption due to an unexpected failure! This seems to be > -far- worse than a normal filesystem---ext2fs would at least choke and > force fsck to be run, which might actually fix the problem, but this > is ridiculous---it just silently trashes random files. Stock reiserfs only provides meta-data journalling. It guarantees that structure of you file-system will be correct after journal replay, not content of a files. It will never "trash" file that wasn't accessed at the moment of crash, though. Full data-journaling comes at cost. There is patch by Chris Mason <Mason@Suse.COM> to support data journaling in reiserfs. Ext3 supports it also. > > So I now have possibly-undetected filesystem damage. My -guess- is > that only files written within a few minutes of the reset are likely > to be affected, but I really don't know, and don't know of a good way > to find out. Must I reinstall the OS -again-, starting from a blank > partition, to be sure? Maybe I should just give up on ReiserFS completely. > > [If there is a more-appropriate place for me to send this---such as > a particular Mandrake list, or a particular ReiserFS list---please let > me know, particularly if I can get a quick answer -without- going Reiserfs mail-list <Reiserfs-List@Namesys.COM>, archive at http://marc.theaimsgroup.com/?l=reiserfs&r=1&w=2 > through the overhead of subscribing to the list, being flooded, and > unsubscribing---that's what archives are for. Some websearching > for "ReiserFS corruption" yielded -thousands- of hits---not a good > sign---and a very large proportion of them were on this list, so I > figure this is as good a place to ask as any. Thanks again.] Nikita. ^ permalink raw reply [flat|nested] 24+ messages in thread
* ReiserFS data corruption in very simple configuration 2001-09-22 12:47 ` Nikita Danilov @ 2001-09-22 20:44 ` foner-reiserfs 2001-09-25 13:28 ` Stephen C. Tweedie 2001-09-24 9:25 ` [reiserfs-list] " Jens Benecke 2001-09-25 20:13 ` Mike Fedyk 2 siblings, 1 reply; 24+ messages in thread From: foner-reiserfs @ 2001-09-22 20:44 UTC (permalink / raw) To: Nikita; +Cc: linux-kernel, Reiserfs-List, foner-reiserfs Date: Sat, 22 Sep 2001 16:47:31 +0400 From: Nikita Danilov <Nikita@Namesys.COM> Stock reiserfs only provides meta-data journalling. It guarantees that structure of you file-system will be correct after journal replay, not content of a files. It will never "trash" file that wasn't accessed at the moment of crash, though. Thanks for clarifying this. However, I should point out that the failure mode is quite serious---whereas ext2fs would simply fail to record data written to a file before a sync, reiserfs seems to have instead -swapped random pieces of one file with another-, which is -much- harder to detect and fix. I can live with uncommitted changes, but it's hard to justify the behavior I saw---it means that any even slightly-busy machine that experiences a crash could have dozens or hundreds of files with each others' contents all mixed together---remember, parts of my XF86Config file wound up in wtmp! And both XF86Config and wtmp had been written at least 20 seconds before I had to push the reset button, and perhaps > 30 seconds; I don't recall how often the FS is syncing by default, but it's disturbing behavior. After all, at the time I pushed reset, I had -no- files actually being written by any process under my direct control; I'd merely written one file out from Emacs under a minute earlier. I'd hate to think of what would happen if this sort of thing hit a CVS repository. This seems to outweigh the convenience of a rapid start after failure (one of the reasons I decided to try reiserfs in the first place), because a failure means possibly having to recover an entire file server from backups (hence losing -lots more- data) because you don't know which files might have been trashed if the machine loses power or the kernel panics. There's no simple test ("did my edits make it into the file?"), and no way to really know if the machine might later misbehave because critical files have been overwritten with parts of others. (This inability to easily figure out what might have been affected also means that the damage will rapidly propagate to backups, hence making the backups useless.) About the only way around it would seem to be to checksum every file in the FS at regular intervals, and rechecksum after a crash---at which point, what's the point of not having to run fsck? Is this -really- how reiserfs is supposed to behave in a crash? It seems like this should be prominently documented in the description of the file system---I know that I'm rather nervous about using it if this is true, since it turns a few minutes of fsck'ing (for ext2fs) into a restore-the-whole-file-system exercise instead. Surely that's not right. If this is really supposed to be how reiserfs behaves any time it doesn't get to sync before a machine dies on it, I can't see how it can be justified for any production use, and I'll probably have to reinstall my OS using ext2fs instead. Full data-journaling comes at cost. There is patch by Chris Mason <Mason@Suse.COM> to support data journaling in reiserfs. Ext3 supports it also. Do you have a URL for this? A search for reiserfs and mason yields 12,000 hits. (I'm particularly interested in one for reiserfs 3.6.25 and Mandrake 8.0, but I assume there may be several variants in the same repository.) > So I now have possibly-undetected filesystem damage. My -guess- is > that only files written within a few minutes of the reset are likely > to be affected, but I really don't know, and don't know of a good way > to find out. Must I reinstall the OS -again-, starting from a blank > partition, to be sure? Maybe I should just give up on ReiserFS completely. > > [If there is a more-appropriate place for me to send this---such as > a particular Mandrake list, or a particular ReiserFS list---please let > me know, particularly if I can get a quick answer -without- going Reiserfs mail-list <Reiserfs-List@Namesys.COM>, archive at http://marc.theaimsgroup.com/?l=reiserfs&r=1&w=2 Thanks. I saw that list before, and I'm glad that you've included it in this discussion. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-09-22 20:44 ` foner-reiserfs @ 2001-09-25 13:28 ` Stephen C. Tweedie 2001-09-29 4:44 ` Lenny Foner 0 siblings, 1 reply; 24+ messages in thread From: Stephen C. Tweedie @ 2001-09-25 13:28 UTC (permalink / raw) To: foner-reiserfs; +Cc: Nikita, Stephen Tweedie, linux-kernel Hi, On Sat, Sep 22, 2001 at 04:44:21PM -0400, foner-reiserfs@media.mit.edu wrote: > Stock reiserfs only provides meta-data journalling. It guarantees that > structure of you file-system will be correct after journal replay, not > content of a files. It will never "trash" file that wasn't accessed at > the moment of crash, though. > > Thanks for clarifying this. However, I should point out that the > failure mode is quite serious---whereas ext2fs would simply fail > to record data written to a file before a sync, reiserfs seems to > have instead -swapped random pieces of one file with another-, > which is -much- harder to detect and fix. Not true. ext2, ext3 in its "data=writeback" mode, and reiserfs can all demonstrate this behaviour. Reiserfs is being no worse than ext2 (the timings may make the race more or less likely in reiserfs, but ext2 _is_ vulnerable.) e2fsck only restores metadata consistency on ext2 after a crash: it can't possibly guarantee that all the data blocks have been written. ext3 will let you do full data journaling, but also has a third mode (the default), which doesn't journal data, but which does make sure that data is flushed to disk before the transaction which allocated that data is allowed to commit. That gives you most of the performance of ext3's fast-and-loose writeback mode, but with an absolute guarantee that you never see stale blocks in a file after a crash. Cheers, Stephen ^ permalink raw reply [flat|nested] 24+ messages in thread
* ReiserFS data corruption in very simple configuration 2001-09-25 13:28 ` Stephen C. Tweedie @ 2001-09-29 4:44 ` Lenny Foner 2001-09-29 12:52 ` [reiserfs-list] " Lehmann 2001-10-01 11:30 ` Stephen C. Tweedie 0 siblings, 2 replies; 24+ messages in thread From: Lenny Foner @ 2001-09-29 4:44 UTC (permalink / raw) To: sct; +Cc: Nikita, Mason, linux-kernel, reiserfs-list, foner-reiserfs [As before, please make sure you CC me on replies or I won't see them. Tnx!] Date: Tue, 25 Sep 2001 14:28:54 +0100 From: "Stephen C. Tweedie" <sct@redhat.com> Hi, On Sat, Sep 22, 2001 at 04:44:21PM -0400, foner-reiserfs@media.mit.edu wrote: > Stock reiserfs only provides meta-data journalling. It guarantees that > structure of you file-system will be correct after journal replay, not > content of a files. It will never "trash" file that wasn't accessed at > the moment of crash, though. > > Thanks for clarifying this. However, I should point out that the > failure mode is quite serious---whereas ext2fs would simply fail > to record data written to a file before a sync, reiserfs seems to > have instead -swapped random pieces of one file with another-, > which is -much- harder to detect and fix. Not true. ext2, ext3 in its "data=writeback" mode, and reiserfs can all demonstrate this behaviour. Reiserfs is being no worse than ext2 (the timings may make the race more or less likely in reiserfs, but ext2 _is_ vulnerable.) ext2fs can write parts of file A to file B, and vice versa, and this isn't fixed by fsck? [See outcome (d) below.] I'm having difficulty believing how this can be possible for a non-journalling filesystem. e2fsck only restores metadata consistency on ext2 after a crash: it can't possibly guarantee that all the data blocks have been written. But what about written to the wrong files? See below. ext3 will let you do full data journaling, but also has a third mode (the default), which doesn't journal data, but which does make sure that data is flushed to disk before the transaction which allocated that data is allowed to commit. That gives you most of the performance of ext3's fast-and-loose writeback mode, but with an absolute guarantee that you never see stale blocks in a file after a crash. I've been getting a stream of private mail over the last few days saying one thing or another about various filesystems with various optional patches, so let me get this out in the open and see if we can converge on an answer here. [ext2f2, ext3fs, and reiserfs answers should feel free to cite which mode they're talking about and URLs for whatever patches are required to get to that mode; some impressions about reliability and maturity would be useful, too.] Let's take this scenario: Files A and B have had blocks written to them sometime in the recent past (30 to 60 seconds or so) and a sync has not happened yet. (I don't know how often reiserfs will be synced by default; 60 seconds? Longer? Presumably running "sync" will force it, but I don't know when else it will happen.) File A may have been completely rewritten or newly written (e.g., what Emacs does when it saves a file), whereas file B may have simply been appended to (e.g., what happens when wtmp is updated). The CPU reset button is then pushed. [See P.P.S. at end of this message.] Now, we have the following possibilities for the outcome after the system comes back up and has finished checking its filesystem: (a) Metadata correctly written, file data correctly written. (b) Metadata correctly written, file data partially written. (E.g., one or both files might have been partially or completely updated.) (c) Metadata correctly written, file data completely unwritten. (Neither file got updated at all.) (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B. (E.g., File A gets some of file B written somewhere within it, and file B gets some of file A written somewhere within it---this is the behavior I observed, at least twice, with reiserfs.) (e) Metadata corrupted in some fashion, file data undefined. ("Undefined" means could be any of (a) through (d) above; I don't care.) Now, which filesystems can show each outcome? I don't know. I contend that reiserfs does (d). Stephen Tweedie talks above about whether we can "guarantee that all the data blocks have been written", but may be missing the point I was making, namely that THE BLOCKS HAVE BEEN WRITTEN TO THE WRONG FILES. It would be nice to know, for each of ext2fs, ext3fs, and reiserfs, what the -intended- outcome is, and what the -actual- outcome is (since implementation bugs might make the actual outcome different from the intended outcome). Any additional filesystems anyone would like to toss into the pot would be welcome; maybe I'll post a matrix of the results, if we get some. I'm -assuming- that the intended outcome for reiserfs (without data journalling) is one of (a), (b), or (c). If the intended outcome for reiserfs without data journalling [or -any- FS, really] is in fact (d), then I don't understand how this filesystem can be intended for any reliable service, since a failure will garble all files written in the last several seconds in a fashion that is very, very difficult to unscramble. (-Perhaps-, if all the metadata is indeed correct, it would be possible to at least -identify- which files may have gotten smashed, by looking for all files whose mtime or ctime is in the last 60 seconds (more?) before the failure, but they'd still be trashed in bizarre ways---it's much easier to fix a file (particularly a text file) that is simply out of date (having had only some, or none, of its recent data written) then it is to fix one that's had data from other file(s) added to it in random places. Furthermore, files such as wtmp will probably get their mtime modified the instant the system comes back up, further muddying the waters.) Can someone(s) help to address the above? And, even better, could this information be placed prominently on the web pages describing the relevant file systems? It would be extremely useful for people trying to decide which one to run to know this -before- they have committed umpteen gigabytes to one or the other and -then- get bitten. Thanks! P.S. Nikita Danilov said that there is a data-journalling patch to reiserfs written Chris Mason <Mason@Suse.COM>, but has not responded with a URL to it; can someone (or Chris? now CC'ed) do so? A search for reiserfs and mason is useless, yielding 12,000 hits. (I'm particularly interested in one for reiserfs 3.6.25 and Mandrake 8.0, but I assume there may be several variants in the same repository.) Benchmarking data on the performance impact of data journalling for reiserfs, ext3fs, and anything else anyone cares to supply would probably be useful to lots of people at well. P.P.S. I say reset and not power-off, although I hope that this is moot, because I presume that the unsynced data, by virtue of being unsynced, is nowhere near the disk datapaths anyway. But either way, a reset should let the disks continue to write data out of their write buffers, assuming that a CPU reset doesn't flush such pending transactions; I don't know if there's some IDE bus sequence that can do this, and whether CPU reset would issue such a sequence. It may not matter; is it common that disks might leave data buffered but unwritten for 30 seconds if there is no other disk activity? I would hope that this is -not- true and that the buffered data is buffered only while there is other activity, since failing to flush the buffer when the disk is idle only increases the risk of losing it without improving performance at all. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [reiserfs-list] ReiserFS data corruption in very simple configuration 2001-09-29 4:44 ` Lenny Foner @ 2001-09-29 12:52 ` Lehmann 2001-10-01 1:00 ` foner-reiserfs 2001-10-01 11:30 ` Stephen C. Tweedie 1 sibling, 1 reply; 24+ messages in thread From: Lehmann @ 2001-09-29 12:52 UTC (permalink / raw) To: Lenny Foner; +Cc: sct, Nikita, Mason, linux-kernel, reiserfs-list On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner <foner-reiserfs@media.mit.edu> wrote: > isn't fixed by fsck? [See outcome (d) below.] I'm having difficulty > believing how this can be possible for a non-journalling filesystem. If you have difficulties in believing this, may I ask you how you think it is possible for a non-journaling filesystem to prevent this at all? > But what about written to the wrong files? See below. What you see is most probably *old* data, not data from another (still existing) file. > has not happened yet. (I don't know how often reiserfs will be synced > by default; 60 seconds? Longer? Presumably running "sync" will force mostly like with any other filesystem (man bdflush) > Now, we have the following possibilities for the outcome after the > (a) Metadata correctly written, file data correctly written. all filesystems ;) > (b) Metadata correctly written, file data partially written. > (E.g., one or both files might have been partially or completely > updated.) ext2, reiserfs. > (c) Metadata correctly written, file data completely unwritten. > (Neither file got updated at all.) ext2, reiserfs. > (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B. this shouldn't happen on reiserfs. however, the unwritten parts of file a can easily contain data formerly in file b. > (e) Metadata corrupted in some fashion, file data undefined. > ("Undefined" means could be any of (a) through (d) above; I don't care.) this should be prevented by journaling (of course, this won't help against harddisk failures) on reiserfs. ext2 often has this problem, but fsck usually can repair it. it's easy to tell metadata from filedata on ext2. > whether we can "guarantee that all the data blocks have been written", > but may be missing the point I was making, namely that THE BLOCKS HAVE > BEEN WRITTEN TO THE WRONG FILES. remember that the blocks have previous content, and reiserfs' tails optimization means that files appended all the time (wtmp) can move around rapidly (at least their tail). > P.P.S. I say reset and not power-off, although I hope that this is > moot, because I presume that the unsynced data, by virtue of being > unsynced, is nowhere near the disk datapaths anyway. this can make a big difference. many disks (ibm, maxtor) nowadays write partial blocks on power outage, this gives "Uncorrectable read errors", which is fatal, because no filesystem so far can work around this. It's easy to repair (just rewrite the block), but would requite filesystem feedback (hey, reisrefs, this metadata block is *garbage*). > a reset should let the disks continue to write data out of their write > buffers, assuming that a CPU reset doesn't flush such pending they should, yes. OTOH, ide disks are cheap... > not matter; is it common that disks might leave data buffered but > unwritten for 30 seconds if there is no other disk activity? I would no. and it doesn't make sense. but it's not forbidden or sth. -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | | ^ permalink raw reply [flat|nested] 24+ messages in thread
* [reiserfs-list] ReiserFS data corruption in very simple configuration 2001-09-29 12:52 ` [reiserfs-list] " Lehmann @ 2001-10-01 1:00 ` foner-reiserfs 2001-10-01 1:26 ` Lehmann 0 siblings, 1 reply; 24+ messages in thread From: foner-reiserfs @ 2001-10-01 1:00 UTC (permalink / raw) To: pcg; +Cc: sct, Nikita, Mason, linux-kernel, reiserfs-list, foner-reiserfs Date: Sat, 29 Sep 2001 14:52:29 +0200 From: <pcg@goof.com ( Marc) (A.) (Lehmann )> Thanks for your response! Bear with me, though, because I'm asking a design question below that relates to this. On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner <foner-reiserfs@media.mit.edu> wrote: > isn't fixed by fsck? [See outcome (d) below.] I'm having difficulty > believing how this can be possible for a non-journalling filesystem. If you have difficulties in believing this, may I ask you how you think it is possible for a non-journaling filesystem to prevent this at all? Naively, one would assume that any non-journalling FS that has written correct metadata through to the disk would either have written updates into files, or failed to write them, but would not have written new (<60 second old) data into different files than the data was destined for. (I suppose the assumption I'm making here is that, when creating or extending a file, the metadata is written -last-, e.g., file blocks are allocated, file data is written, and -then- metadata is written. That way, a failure anywhere before finality simply seems to vanish, whereas writing metadata first seems to cause the lossage below.) > But what about written to the wrong files? See below. What you see is most probably *old* data, not data from another (still existing) file. I'm... dubious, but maybe. As mentioned earlier in this thread, one of the failures I saw consisted of having several lines of my XFree86.0.log file appended to wtmp---when I logged in after the failure, I got "Last login: " followed by several lines from that file instead of a date. (Other failures scrambled other files worse.) Now, it's -possible- that rsfs allocated an extra portion to the end of wtmp for the last-login data (as a user of the fs, I don't care whether officially this was a "block", an entry in a journal, etc), login "wrote" to that region (but it wasn't committed yet 'cause no sync), my XFree86.0.log file was "created" and "written" (again uncommitted), I pushed reset, and then when it came back up, the end of wtmp had data from the -previous- copy of XFree86.0.log that had been freed (because it was unlinked when the next copy was written) but which had not actually had the wtmp data written to it yet (because a sync hadn't happened). I have no way to verify this, since one XFree86.0.log looks much like the other. Conceptually, this would imply that wtmp was extended into disk freespace, which just happened to have that logfile in it (instead of zero bytes). Is this what you're talking about when you say "*old* data"? I think so, and that seems to match your comment below about file tails moving around rapidly. But it doesn't explain -why- it works this way in the first place. Wouldn't it make more sense to commit metadata to disk -after- the data blocks are written? After all, if -either one- isn't written, the file is incomplete. But if the metadata is written -last-, the file simply looks like the data was never added. If the metadata is written -first-, the file can scoop up random trash from elsewhere in the filesystem. I contend that this is -much- worse, because it can render a previously-good file completely unparseable by tools that expect that -all- of the file is in a particular syntax. It's just an accident, I guess, that login will accept any random trash when it prints its "last-login" message, rather than falling over with a coredump because it doesn't look like a date. [And see * below.] Unfortunately, this behavior meant that X -did- fall over, because my XF86Config file was trashed by being scrambled---I'd recently written out a new version, after all---and the trashed copy no longer made any sense. I would have been -much- happier to have had the -unmodified-, -old- version than a scrambled "new" version! Without Emacs ~ files, this would have been much worse. Consider an app that, "for reliability", rewrites a file by creating a temp copy, writing it out, then renaming the temp over the original [this is how Emacs typically saves files]. But if you write the metadata first, you foil this attempt to be safe, because you might have this sequence at the actual disk: [magnetic oxide updated w/rename][start updating magnetic oxide with tempfile data][power failure or reset]---ooops! original file gone, new file doesn't have its data yet, so sorry, thanks for playing. By writing metadata first, it seems that reiserfs violates the idempotence of many filesystem operations, and does exactly the opposite of what "journalling" implies to anyone who understands databases, namely that either the operation completes entirely, or it is completely undone. Yes, yes, I know (now!) that it claims to only journal the metadata, but how does this help when what it's essentially doing is trashing the -data- in unexpected ways exactly when such journalling is supposed to help, namely across a machine failure? This seems like such an elementary design defect that I'm at a loss to understand why it's there. There -must- be some excellent reason, right? But what? And if not, can it be fixed? I'm also still waiting to find out how to make reiserfs actually journal its data, and what the performance implications of this are. No one has responded with a URL. [*] It's also a security hole. If I want to read a file that I'm not authorized to read, -but- I can cause a kernel panic (or a blackout!), then I can craftily wait until up to several seconds after the "secure" file is being rewritten (presumably via the write-tempfile- and-relink method), create a big file of my own, and force the panic---my file may then get some of the secure blocks from the old copy. And, unlike filesystems that write metadata last, the "secure" program can't just zero out the blocks of the file it's about to unlink, because---since metadata is written first---those zeroes won't have made it to disk yet even though the blocks have been declared free and included in my file. I now know what's in your file. Whoops. And this is such an enormous timing hole that I can write a program that just checks every 5 seconds or so for a new copy of the secure file, -then- forces the failure---I need not get the timing very good, as long as it's likely that I'll do so before the next sync. It's so bad that, even if I can't force a panic, my program can just beep and I'll immediately go short out the outlet that happens to be on the same circuit as the machine I'm attacking. [ . . . ] > (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B. this shouldn't happen on reiserfs. however, the unwritten parts of file a can easily contain data formerly in file b. Then why allow metadata to be written first instead of last? > (e) Metadata corrupted in some fashion, file data undefined. > ("Undefined" means could be any of (a) through (d) above; I don't care.) this should be prevented by journaling (of course, this won't help against harddisk failures) on reiserfs. ext2 often has this problem, but fsck usually can repair it. it's easy to tell metadata from filedata on ext2. > whether we can "guarantee that all the data blocks have been written", > but may be missing the point I was making, namely that THE BLOCKS HAVE > BEEN WRITTEN TO THE WRONG FILES. remember that the blocks have previous content, and reiserfs' tails optimization means that files appended all the time (wtmp) can move around rapidly (at least their tail). [ . . . ] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [reiserfs-list] ReiserFS data corruption in very simple configuration 2001-10-01 1:00 ` foner-reiserfs @ 2001-10-01 1:26 ` Lehmann 2001-10-01 2:32 ` foner-reiserfs 2001-10-03 16:28 ` Toby Dickenson 0 siblings, 2 replies; 24+ messages in thread From: Lehmann @ 2001-10-01 1:26 UTC (permalink / raw) To: foner-reiserfs; +Cc: sct, Nikita, Mason, linux-kernel, reiserfs-list On Sun, Sep 30, 2001 at 09:00:49PM -0400, foner-reiserfs@media.mit.edu wrote: > extending a file, the metadata is written -last-, e.g., file blocks > are allocated, file data is written, and -then- metadata is written. this is almost impossible to achieve with existing hardware (witness the many discussions about disk caching for example), and, without journaling, might even be slow. > of wtmp had data from the -previous- copy of XFree86.0.log that had > been freed (because it was unlinked when the next copy was written) > but which had not actually had the wtmp data written to it yet It's easily possible, but it could also be a bug. Let's the reiserfs authors decide. However, if it is indeed "a bug" then fixing it would only lower the frequency of occurance. Only ext3 (some modes) + turning off your harddisk's cache can ensure this, at the moment. > to have that logfile in it (instead of zero bytes). Is this what > you're talking about when you say "*old* data"? I think so, and that > seems to match your comment below about file tails moving around > rapidly. appending to logfiles will result in a lot of movement. with other, strictly block-based filesystems this occurs relatively frequent, and data will not usually move around. with reiserfs tail movement is frequent. > Wouldn't it make more sense to commit metadata to disk -after- the > data blocks are written? The problem is that there is currently no easy way to achieve that. > file simply looks like the data was never added. If the metadata is > written -first-, the file can scoop up random trash from elsewhere in Also, this is not a matter of metadata first or last. Sometimes you need metadata first, sometimes you need it last. And in many cases, "metadata" does not need to change, while data still changes. > the filesystem. I contend that this is -much- worse, because it can > render a previously-good file completely unparseable by tools that > expect that -all- of the file is in a particular syntax. It depends - with ext2 you frequently have garbled files, too. Basically, if you write to a file and turn off the power the outcome is unexpected, and will always be (unless you are ready to take the big speed hit). > Unfortunately, this behavior meant that X -did- fall over, because my > XF86Config file was trashed by being scrambled---I'd recently written > out a new version, after all---and the trashed copy no longer made any But the same thing can and does happen with ext2, depending on your editor and your timing. It is not a reiserfs thing. > But if you write the metadata first, you foil this attempt to be safe, > because you might have this sequence at the actual disk: [magnetic > oxide updated w/rename][start updating magnetic oxide with tempfile > data][power failure or reset]---ooops! original file gone, new file > doesn't have its data yet, so sorry, thanks for playing. Of course. If you want data to hit the disk, you have to use fsync. This does work with reiserfs and will ensure that the data hits the disk. If you don't do this then bad things might happen. > By writing metadata first, it seems that reiserfs violates the > idempotence of many filesystem operations, and does exactly the > opposite of what "journalling" implies to anyone who understands > databases, namely that either the operation completes entirely, or it > is completely undone. You are confusing databases with filesystems, however. Most journaling filesystems work that way. Some (like ext3) are nice enough to let you choose. > journal the metadata, but how does this help when what it's essentially > doing is trashing the -data- in unexpected ways exactly when such > journalling is supposed to help, namely across a machine failure? But ext2 works in the same way. It does happen more often with reiserfs (especially with tails), but ignoring the problem for ext2 doesn't make it right. If applications don't work reliably with reisrefs, they don't work reliably with ext2. If you want reliability then mount synchronous. > This seems like such an elementary design defect that I'm at a loss > to understand why it's there. About every filesystem does have this "elementary design defect". If you want data to hit the disk, sync it. Its that simple. > There -must- be some excellent reason, > right? But what? And if not, can it be fixed? Speed is an excellent reason. The fix is to tell the kernel to write the data out to the platters. Anyway, this is a good time to review the various discussions on the reiserfs list and the kernel list on how to teach the kernel (if it is possible) to implement loose write-ordering. -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | | ^ permalink raw reply [flat|nested] 24+ messages in thread
* [reiserfs-list] ReiserFS data corruption in very simple configuration 2001-10-01 1:26 ` Lehmann @ 2001-10-01 2:32 ` foner-reiserfs 2001-10-03 16:28 ` Toby Dickenson 1 sibling, 0 replies; 24+ messages in thread From: foner-reiserfs @ 2001-10-01 2:32 UTC (permalink / raw) To: pcg; +Cc: sct, Nikita, Mason, linux-kernel, reiserfs-list, foner-reiserfs Date: Mon, 1 Oct 2001 03:26:27 +0200 From: <pcg@goof.com ( Marc) (A.) (Lehmann )> On Sun, Sep 30, 2001 at 09:00:49PM -0400, foner-reiserfs@media.mit.edu wrote: > extending a file, the metadata is written -last-, e.g., file blocks > are allocated, file data is written, and -then- metadata is written. this is almost impossible to achieve with existing hardware (witness the many discussions about disk caching for example), and, without journaling, might even be slow. I think perhaps we may be talking past each other; let me try to clarify. As I said earlier in this thread, this has nothing at all to do with disk caching. Let me restate this again: The scenario I'm discussing is an otherwise-idle machine that had 2 (maybe 3) files modified, sat idle for 30-60 seconds, and then had the reset button pushed. I would expect that either file data and metadata got written, or neither got written, but not metadata without file data. This is repeatable more or less at will---I didn't -just- happen to catch it -just- as it decided to frob the disks. Instead, the problem seems to be that reiserfs is perfectly happy to update the on-disk representation of which disk blocks contain which files' data, and then -sit there- for a long time (a minute? longer?) without -also- attempting to flush the file data to the disk. This then leads to corrupted files after the reset. It's not that the CPU sent data to the disk subsystem that failed to be written by the time of the interruption; it's that the data was still sitting in RAM and the CPU hadn't even decided to get it out the IDE channel yet. This means that there is -always- a giant timing hole which can corrupt data, as opposed to just the much-tinier hole that would be created if the file-bytes-to-disk-bytes correspondence were updated immediately after the write that wrote the data---it would be hard for me to accidentally hit such a hole. > of wtmp had data from the -previous- copy of XFree86.0.log that had > been freed (because it was unlinked when the next copy was written) > but which had not actually had the wtmp data written to it yet It's easily possible, but it could also be a bug. Let's the reiserfs authors decide. However, if it is indeed "a bug" then fixing it would only lower the frequency of occurance. True, but as long as it makes it only happen if the disk is -in progress of writing stuff- when the reset or power failure happens, the risk is -greatly- reduced. Right now, it's an enormous timing hole, and one that's likely to be hit---it's happened to me -every single time- I've had to hit the reset button because (for example) I wedged X while debugging, and even if I waited a minute after the wedge-up to do so! The way I've avoided it is by running a job that syncs once a second while doing debugging that might possibly make me unable to take the machine down cleanly. This is a disgusting and unreliable kluge. Only ext3 (some modes) + turning off your harddisk's cache can ensure this, at the moment. Or ext3 (some modes) + assuming that the disk will at least write data that's been sent to it, even if the CPU gets reset. (I know it's hopeless if power fails, but that can be made arbitrarily unlikely, compared to a kernel panic or having to do a CPU reset.) > to have that logfile in it (instead of zero bytes). Is this what > you're talking about when you say "*old* data"? I think so, and that > seems to match your comment below about file tails moving around > rapidly. appending to logfiles will result in a lot of movement. with other, strictly block-based filesystems this occurs relatively frequent, and data will not usually move around. with reiserfs tail movement is frequent. Right. > Wouldn't it make more sense to commit metadata to disk -after- the > data blocks are written? The problem is that there is currently no easy way to achieve that. Why not? (Ignore the disk-caching issue and concentrate on when the kernel asks for data to be written to the disk. I am -assuming that the kernel either (a) writes the data in the order requested, or at least (b) once it decides to write anything, keeps sending it to the disk until its queue is completely empty.) > file simply looks like the data was never added. If the metadata is > written -first-, the file can scoop up random trash from elsewhere in Also, this is not a matter of metadata first or last. Sometimes you need metadata first, sometimes you need it last. And in many cases, "metadata" does not need to change, while data still changes. I'm using "metadata" here as a shorthand for "how the filesystem knows which byte on disk corresponds to which byte in the file", not just things like atime, ctime, etc. > the filesystem. I contend that this is -much- worse, because it can > render a previously-good file completely unparseable by tools that > expect that -all- of the file is in a particular syntax. It depends - with ext2 you frequently have garbled files, too. Basically, if you write to a file and turn off the power the outcome is unexpected, and will always be (unless you are ready to take the big speed hit). > Unfortunately, this behavior meant that X -did- fall over, because my > XF86Config file was trashed by being scrambled---I'd recently written > out a new version, after all---and the trashed copy no longer made any But the same thing can and does happen with ext2, depending on your editor and your timing. It is not a reiserfs thing. Well, I've gotten several pieces of private mail from people complaining that it's happening a lot more with reiserfs. And I've never been bitten this way in years of ext2 usage. > But if you write the metadata first, you foil this attempt to be safe, > because you might have this sequence at the actual disk: [magnetic > oxide updated w/rename][start updating magnetic oxide with tempfile > data][power failure or reset]---ooops! original file gone, new file > doesn't have its data yet, so sorry, thanks for playing. Of course. If you want data to hit the disk, you have to use fsync. This does work with reiserfs and will ensure that the data hits the disk. If you don't do this then bad things might happen. It's that I either want the data to hit the disk, or -not- to hit the disk, but not to partially-update files such that things are inconsistent even when the disk has been idle for 20 seconds and the system isn't doing anything else. It's even worse in that the filesystem -believes- itself to be accurate, even though the data it's actually storing is scrambled. > By writing metadata first, it seems that reiserfs violates the > idempotence of many filesystem operations, and does exactly the > opposite of what "journalling" implies to anyone who understands > databases, namely that either the operation completes entirely, or it > is completely undone. You are confusing databases with filesystems, however. Most journaling filesystems work that way. Some (like ext3) are nice enough to let you choose. I am deliberately talking about databases, because the terminology and technology of journalling came from there. Using the term "journalling" and then behaving very differently from the way it's used in database design is misleading at best. Several people who've written to me have said they felt "cheated" to discover that reiserfs didn't actually journal the data or otherwise misbehaved in ways similar to my problem here. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [reiserfs-list] ReiserFS data corruption in very simple configuration 2001-10-01 1:26 ` Lehmann 2001-10-01 2:32 ` foner-reiserfs @ 2001-10-03 16:28 ` Toby Dickenson 1 sibling, 0 replies; 24+ messages in thread From: Toby Dickenson @ 2001-10-03 16:28 UTC (permalink / raw) To: pcg; +Cc: foner-reiserfs, sct, Nikita, Mason, linux-kernel, reiserfs-list >Of course. If you want data to hit the disk, you have to use fsync. This >does work with reiserfs and will ensure that the data hits the disk. If >you don't do this then bad things might happen. This is probably a naive question, but this thread has already proved me wrong on one naive assumption..... If the sequence is: 1. append some data to file A 2. fsync(A) 3. append some further data to A 4. some writes to other files 5. power loss Is it guaranteed that all the data written in step 1 will still be intact? The potential problem I can see is that some data from step 1 may have been written in a tail, the tail moves during step 3, and then the original tail is overwritten before the new tail (including data from before the fsync) is safely on disk. Thanks for your help, Toby Dickenson tdickenson@geminidataloggers.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-09-29 4:44 ` Lenny Foner 2001-09-29 12:52 ` [reiserfs-list] " Lehmann @ 2001-10-01 11:30 ` Stephen C. Tweedie 1 sibling, 0 replies; 24+ messages in thread From: Stephen C. Tweedie @ 2001-10-01 11:30 UTC (permalink / raw) To: Lenny Foner; +Cc: sct, linux-kernel, reiserfs-list Hi, On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner wrote: > Not true. ext2, ext3 in its "data=writeback" mode, and reiserfs can > all demonstrate this behaviour. Reiserfs is being no worse than ext2 > (the timings may make the race more or less likely in reiserfs, but > ext2 _is_ vulnerable.) > > ext2fs can write parts of file A to file B, and vice versa, and this > isn't fixed by fsck? No, we're not talking about incorrect writes, but *incomplete* writes, which is a totally different thing. An ext2 write of new data involves many steps: the inode needs to be written to mark the file's new size, the indirect mapping block[s] may have to be written to record where the data is, and the data blocks themselves need to be written. Not only that, but a delete also requires multiple writes. If you delete a file and rapidly create a new one, then the image of the filesystem in cache remains totally consistent, but the copy on disk is updated incrementally and if you crash before the entire image is updated, you can end up seeing both bits of the old file that was in the process of being deleted, and the new file that was being created. In addition, journaling prevents metadata inconsistencies from occuring due to incomplete writes, but on its own, metadata journaling doesn't mean that the data blocks are also in sync --- the disk blocks describing a new file might be on disk, but the data blocks that the file contains might not be. Reiserfs, and also ext3 in its fastest "writeback" mode, both behave like this (but ext3's other modes order data writes so that this situation never happens: data blocks are always flushed to disk before the metadata is committed.) > e2fsck only restores metadata consistency on ext2 after a crash: it > can't possibly guarantee that all the data blocks have been written. > > But what about written to the wrong files? See below. See above. If all the metadata is intact, how can e2fsck *possibly* detect whether a data block contains the old or the new contents of the block? > Let's take this scenario: Files A and B have had blocks written to > them sometime in the recent past (30 to 60 seconds or so) and a sync > has not happened yet. (I don't know how often reiserfs will be synced > by default; 60 seconds? Longer? Presumably running "sync" will force > it, but I don't know when else it will happen.) File A may have been > completely rewritten or newly written (e.g., what Emacs does when it > saves a file), whereas file B may have simply been appended to (e.g., > what happens when wtmp is updated). > > The CPU reset button is then pushed. [See P.P.S. at end of this message.] > > Now, we have the following possibilities for the outcome after the > system comes back up and has finished checking its filesystem: > > (a) Metadata correctly written, file data correctly written. > (b) Metadata correctly written, file data partially written. > (E.g., one or both files might have been partially or completely > updated.) > (c) Metadata correctly written, file data completely unwritten. > (Neither file got updated at all.) > (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B. > (E.g., File A gets some of file B written somewhere within it, > and file B gets some of file A written somewhere within it---this > is the behavior I observed, at least twice, with reiserfs.) > (e) Metadata corrupted in some fashion, file data undefined. > ("Undefined" means could be any of (a) through (d) above; I don't care.) > > Now, which filesystems can show each outcome? I don't know. I > contend that reiserfs does (d). Stephen Tweedie talks above about > whether we can "guarantee that all the data blocks have been written", > but may be missing the point I was making, namely that THE BLOCKS HAVE > BEEN WRITTEN TO THE WRONG FILES. For ext3, (d) will never happen in this case. You can only get "wrong" data blocks if one of the files is being *deleted*, and its blocks have been allocated to a new file, and the handover of those blocks is incomplete at the time of the crash. ext3 will only give you (a) (both metadata and data correctly written) or (f) (neither have yet been written at all) if it is running in ordered or data-journaling mode. (b) and (c) are possible only if you are in writeback mode. (d) and (e) never happen if you're creating two files, although in writeback mode (d) is possible if, say, you are deleting A and writing B at the same time (the other ext3 modes prevent this scenario too.) Cheers, Stephen ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple configuration 2001-09-22 12:47 ` Nikita Danilov 2001-09-22 20:44 ` foner-reiserfs @ 2001-09-24 9:25 ` Jens Benecke 2001-10-14 14:52 ` Chris Mason 2001-09-25 20:13 ` Mike Fedyk 2 siblings, 1 reply; 24+ messages in thread From: Jens Benecke @ 2001-09-24 9:25 UTC (permalink / raw) To: linux-kernel, Reiserfs mail-list [-- Attachment #1: Type: text/plain, Size: 1474 bytes --] On Sat, Sep 22, 2001 at 04:47:31PM +0400, Nikita Danilov wrote: > foner-reiserfs@media.mit.edu writes: > > [Please CC me on any replies; I'm not on linux-kernel.] > > > > The ReiserFS that comes with both Mandrake 7.2 and 8.0 has > > demonstrated a serious data corruption problem, and I'd like to know > > (a) if anyone else has seen this, (b) how to avoid it, and (c) how to > > determine how badly I've been bitten. > > > Stock reiserfs only provides meta-data journalling. It guarantees that > structure of you file-system will be correct after journal replay, not > content of a files. It will never "trash" file that wasn't accessed at > the moment of crash, though. Full data-journaling comes at cost. There is > patch by Chris Mason <Mason@Suse.COM> to support data journaling in > reiserfs. Ext3 supports it also. one question: When I was using ext2 I always mounted the /usr partition read-only, so that a fsck weren't necessary at boot - and the files were all guaranteed to be OK to bring the system up at least. Does this (mount -o ro) make sense with ReiserFS as well? What I mean is, is there a chance of a file getting corrupted that was only *read* (not *written*) at or before a power outage? I mount all my system partitions with -o notail,noatime if that makes any difference. -- Jens Benecke ········ http://www.hitchhikers.de/ - Europas Mitfahrzentrale rm -rf /bin/laden [-- Attachment #2: Type: application/pgp-signature, Size: 240 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple configuration 2001-09-24 9:25 ` [reiserfs-list] " Jens Benecke @ 2001-10-14 14:52 ` Chris Mason 2001-10-14 18:19 ` Jens Benecke 0 siblings, 1 reply; 24+ messages in thread From: Chris Mason @ 2001-10-14 14:52 UTC (permalink / raw) To: Jens Benecke, linux-kernel, Reiserfs mail-list On Monday, September 24, 2001 11:25:10 AM +0200 Jens Benecke <jens@jensbenecke.de> wrote: > one question: > > When I was using ext2 I always mounted the /usr partition read-only, so > that a fsck weren't necessary at boot - and the files were all guaranteed > to be OK to bring the system up at least. > > Does this (mount -o ro) make sense with ReiserFS as well? What I mean is, > is there a chance of a file getting corrupted that was only *read* (not > *written*) at or before a power outage? Yes, after the mount is finished, reiserfs won't change the files on a readonly mount. -chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple configuration 2001-10-14 14:52 ` Chris Mason @ 2001-10-14 18:19 ` Jens Benecke 2001-10-14 20:04 ` Hans Reiser 2001-10-14 23:32 ` Bernd Eckenfels 0 siblings, 2 replies; 24+ messages in thread From: Jens Benecke @ 2001-10-14 18:19 UTC (permalink / raw) To: linux-kernel, Reiserfs mail-list [-- Attachment #1: Type: text/plain, Size: 976 bytes --] On Sun, Oct 14, 2001 at 10:52:54AM -0400, Chris Mason wrote: > > When I was using ext2 I always mounted the /usr partition read-only, so > > that a fsck weren't necessary at boot - and the files were all > > guaranteed to be OK to bring the system up at least. > > > > Does this (mount -o ro) make sense with ReiserFS as well? What I mean > > is, is there a chance of a file getting corrupted that was only *read* > > (not *written*) at or before a power outage? > > Yes, after the mount is finished, reiserfs won't change the files on a > readonly mount. What I meant is this: AFAIK, if you exclude broken hardware, in ext2 there is no chance of a file that was never written to since mounting being corrupted on a crash, even if the fs was mounted read-write. Is this the same thing with ReiserFS? -- Jens Benecke ········ http://www.hitchhikers.de/ - Europas Mitfahrzentrale Crypto regulations will only hinder criminals who obey the law. [-- Attachment #2: Type: application/pgp-signature, Size: 240 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple configuration 2001-10-14 18:19 ` Jens Benecke @ 2001-10-14 20:04 ` Hans Reiser 2001-10-14 23:32 ` Bernd Eckenfels 1 sibling, 0 replies; 24+ messages in thread From: Hans Reiser @ 2001-10-14 20:04 UTC (permalink / raw) To: Jens Benecke; +Cc: linux-kernel, Reiserfs mail-list Jens Benecke wrote: > What I meant is this: AFAIK, if you exclude broken hardware, in ext2 there > is no chance of a file that was never written to since mounting being > corrupted on a crash, even if the fs was mounted read-write. > > Is this the same thing with ReiserFS? Yes. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [reiserfs-list] Re: ReiserFS data corruption in very simple configuration 2001-10-14 18:19 ` Jens Benecke 2001-10-14 20:04 ` Hans Reiser @ 2001-10-14 23:32 ` Bernd Eckenfels 1 sibling, 0 replies; 24+ messages in thread From: Bernd Eckenfels @ 2001-10-14 23:32 UTC (permalink / raw) To: linux-kernel In article <20011014201907.H20001@jensbenecke.de> you wrote: > What I meant is this: AFAIK, if you exclude broken hardware, in ext2 there > is no chance of a file that was never written to since mounting being > corrupted on a crash Well, you can eighter lose the file due to a broken directory (maybe you find the missing inode in lost+found) or it can even corrupt the file due to a ext2 software error, which is unlikely but all filesystems in development are reported to eat files every now and then. Greetings Bernd ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-09-22 12:47 ` Nikita Danilov 2001-09-22 20:44 ` foner-reiserfs 2001-09-24 9:25 ` [reiserfs-list] " Jens Benecke @ 2001-09-25 20:13 ` Mike Fedyk 2001-09-26 14:43 ` Stephen C. Tweedie 2 siblings, 1 reply; 24+ messages in thread From: Mike Fedyk @ 2001-09-25 20:13 UTC (permalink / raw) To: linux-kernel On Sat, Sep 22, 2001 at 04:47:31PM +0400, Nikita Danilov wrote: > foner-reiserfs@media.mit.edu writes: > > [Please CC me on any replies; I'm not on linux-kernel.] > > I thought the whole point of a journalling file system was to > > -prevent- corruption due to an unexpected failure! This seems to be > > -far- worse than a normal filesystem---ext2fs would at least choke and > > force fsck to be run, which might actually fix the problem, but this > > is ridiculous---it just silently trashes random files. > > Stock reiserfs only provides meta-data journalling. It guarantees that > structure of you file-system will be correct after journal replay, not > content of a files. It will never "trash" file that wasn't accessed at > the moment of crash, though. Full data-journaling comes at cost. There > is patch by Chris Mason <Mason@Suse.COM> to support data journaling in > reiserfs. Ext3 supports it also. > When files on a ReiserFS mount have data from other files, does that mean that it has recovered wrong meta-data, or is it because the meta-data was committed before the data? So, if I write a file, does ReiserFS write the structures first, and if the data isn't written, whatever else was deleted from the block before will now be in the file? If that's so, then one way to keep old deleted data from getting into partially written files after a crash would be to zero out the blocks on unlink. I can imagine that this would prevent undelete, and slow down deleting considerably. Another way, may be to keep a journal of which blocks have actually been committed. Maybe a bitmap in the journal, or some other structure... If you have data journaling, does that mean there is a possability of recovering a complete file -before- it was written? i.e: echo a > test; sync; cat picture.tif > test (writing in progress, only partially in journal) power off Will "a" be in test upon recovery? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-09-25 20:13 ` Mike Fedyk @ 2001-09-26 14:43 ` Stephen C. Tweedie 2001-10-01 3:38 ` Mike Fedyk 0 siblings, 1 reply; 24+ messages in thread From: Stephen C. Tweedie @ 2001-09-26 14:43 UTC (permalink / raw) To: linux-kernel; +Cc: Mike Fedyk, Stephen Tweedie Hi, On Tue, Sep 25, 2001 at 01:13:04PM -0700, Mike Fedyk wrote: > > Stock reiserfs only provides meta-data journalling. It guarantees that > > structure of you file-system will be correct after journal replay, not > > content of a files. It will never "trash" file that wasn't accessed at > > the moment of crash, though. Full data-journaling comes at cost. There > > is patch by Chris Mason <Mason@Suse.COM> to support data journaling in > > reiserfs. Ext3 supports it also. > When files on a ReiserFS mount have data from other files, does that mean > that it has recovered wrong meta-data, or is it because the meta-data was > committed before the data? It can be either, but the former can only be the result of a problem (either hardware fault or a data-corrupting software bug of some description). In the normal case, only the latter scenario happens. ext3 has a mode to flush all data before metadata gets committed. That is its default mode, and it avoids this problem without having to actually journal the data. > So, if I write a file, does ReiserFS write the structures first, and if the > data isn't written, whatever else was deleted from the block before will now > be in the file? Yep. ext3 behaves in the same way in its fastest "writeback" data mode. > If that's so, then one way to keep old deleted data from getting into > partially written files after a crash would be to zero out the blocks on > unlink. I can imagine that this would prevent undelete, and slow down > deleting considerably. Indeed. > Another way, may be to keep a journal of which blocks have actually been > committed. Maybe a bitmap in the journal, or some other structure... ext3 does exactly that. It's necessary to keep things in sync if we have blocks of data being deleted and reallocated as metadata, or vice-versa. > If you have data journaling, does that mean there is a possability of > recovering a complete file -before- it was written? i.e: > echo a > test; > sync; > cat picture.tif > test > (writing in progress, only partially in journal) > power off > Will "a" be in test upon recovery? If you are using full data journaling (ext3's "journal" data mode) or the default "ordered" data mode, then no, you never see such behaviour. In the ordered mode, it achieves this precisely because it is keeping a record of which blocks have been committed (or, more accurately, which *deleted* blocks have had the delete committed). If you do a "cat > file", then before the new data is written, the file gets truncated and all its old data blocks deleted. ext3 will then refuse to reuse those blocks until the delete has been committed, so if we crash and end up rolling back the delete transaction, we'll never see new data blocks in the old file. Cheers, Stephen ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-09-26 14:43 ` Stephen C. Tweedie @ 2001-10-01 3:38 ` Mike Fedyk 2001-10-03 16:14 ` Stephen C. Tweedie 0 siblings, 1 reply; 24+ messages in thread From: Mike Fedyk @ 2001-10-01 3:38 UTC (permalink / raw) To: linux-kernel Hi, On Wed, Sep 26, 2001 at 03:43:11PM +0100, Stephen C. Tweedie wrote: > On Tue, Sep 25, 2001 at 01:13:04PM -0700, Mike Fedyk wrote: > > If you have data journaling, does that mean there is a possability of > > recovering a complete file -before- it was written? i.e: > > > echo a > test; > > sync; > > cat picture.tif > test > > (writing in progress, only partially in journal) > > power off > > > Will "a" be in test upon recovery? > > If you are using full data journaling (ext3's "journal" data mode) or > the default "ordered" data mode, then no, you never see such > behaviour. > At this point, it looks like I'm going to get a partial picture.tif in test after recovery... > In the ordered mode, it achieves this precisely because it is keeping > a record of which blocks have been committed (or, more accurately, > which *deleted* blocks have had the delete committed). If you do a > "cat > file", then before the new data is written, the file gets > truncated and all its old data blocks deleted. ext3 will then refuse > to reuse those blocks until the delete has been committed, so if we > crash and end up rolling back the delete transaction, we'll never see > new data blocks in the old file. > Now, it looks like I'll end up with "a" in test... >From what you're describing, it looks like the contents of test after a truncate won't be overwritten by another transaction until the deletion of those blocks has made it to disk... So, while in ordered, or journal mode, I'd end up with "a" in test, but with writeback mode there is no such guarantee. Am I missing something? Are there any known cases where ext3 will not be able to recover pervious data when a write wasn't able to complete? > Cheers, > Stephen Mike ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-10-01 3:38 ` Mike Fedyk @ 2001-10-03 16:14 ` Stephen C. Tweedie 0 siblings, 0 replies; 24+ messages in thread From: Stephen C. Tweedie @ 2001-10-03 16:14 UTC (permalink / raw) To: linux-kernel Hi, On Sun, Sep 30, 2001 at 08:38:31PM -0700, Mike Fedyk wrote: > >From what you're describing, it looks like the contents of test after a > truncate won't be overwritten by another transaction until the deletion of > those blocks has made it to disk... So, while in ordered, or journal mode, > I'd end up with "a" in test, but with writeback mode there is no such > guarantee. > > Am I missing something? > > Are there any known cases where ext3 will not be able to recover pervious > data when a write wasn't able to complete? It depends on what the application is doing. Applications often open an existing file with O_TRUNC, write to it, then close it. If you crash between the truncate and the write being committed, then you'll get a perfectly legal, sane, consistent, empty file on recovery. --Stephen ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-09-22 10:00 ReiserFS data corruption in very simple configuration foner-reiserfs 2001-09-22 12:47 ` Nikita Danilov @ 2001-10-01 15:27 ` Hans Reiser 2001-10-03 16:17 ` Stephen C. Tweedie 1 sibling, 1 reply; 24+ messages in thread From: Hans Reiser @ 2001-10-01 15:27 UTC (permalink / raw) To: foner-reiserfs; +Cc: linux-kernel This is the meaning of metadata journaling: that writes in progress at the time of the crash may write garbage, but you won't need to fsck. You can get this behaviour with other filesystems like FFS also. If you cannot accept those terms of service, you might use ext3 with data journaling on, but then your performance will be far worse. It is a tradeoff, not a bug. Regarding where to email these types of reiserfs questions, you might email reiserfs-list@namesys.com with such questions, or try www.namesys.com/support.html if you want paid support service on it. Best, Hans foner-reiserfs@media.mit.edu wrote: > > [Please CC me on any replies; I'm not on linux-kernel.] > > The ReiserFS that comes with both Mandrake 7.2 and 8.0 has > demonstrated a serious data corruption problem, and I'd like > to know (a) if anyone else has seen this, (b) how to avoid it, > and (c) how to determine how badly I've been bitten. > > My configuration in each case has been an AMD CPU running ReiserFS > exactly as configured "out of the box" by running the Mandrake 7.2 or > 8.0 installation CD and opting to run ReiserFS instead of the default. > This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID > or anything fancy like that. The hardware itself is rock solid and > has never demonstrated any faults at all. (MDK 8.0 appears to use > RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.) > The machine had barely been used before each corruption problem; I'm > not running some strange root-priv stuff, and each time, the FS hadn't > had more than a few minutes to a few hours of use since being created. > > In each case, I've gotten in trouble by editing my XF86Config-4 file, > guessing wrong on a modeline, hanging X (blank gray screen & no > response to anything), and being forced to hit the reset button > because nothing else worked. Under 7.2, I discovered that my > XF86Config-4 file suddenly had a block of nulls in it. That time, I > thought I must have been hallucinating, but I ran a background job to > sync the filesystem every second while continuing to debug the X > problems, and didn't see the corruption again. > > Now, I was just bitten by the -same- behavior under MDK 8.0. After > accidentally hanging X, I waited a few seconds just in case a sync was > pending, hit reset, and had all sorts of lossage: > (1) Parts of the XF86Conf-4 file had lines garbled, e.g., > sections of the file had apparently been rearranged. > (2) /var/log/XFree86.0.log was truncated, and maybe garbled. > (2) Logging in as root was fine, but then logging in as myself > I got "Last login: <4-5 lines of my XFree86.0.log file (!)>" > instead of a date! Logging in again gave me the proper > last-login time, but clearly wtmp or something else had > gotten stepped on in some weird way. > Obviously, the behavior I saw once under MDK 7.2 was no hallucination > or accidental yank in Emacs. > > I thought the whole point of a journalling file system was to > -prevent- corruption due to an unexpected failure! This seems to be > -far- worse than a normal filesystem---ext2fs would at least choke and > force fsck to be run, which might actually fix the problem, but this > is ridiculous---it just silently trashes random files. > > So I now have possibly-undetected filesystem damage. My -guess- is > that only files written within a few minutes of the reset are likely > to be affected, but I really don't know, and don't know of a good way > to find out. Must I reinstall the OS -again-, starting from a blank > partition, to be sure? Maybe I should just give up on ReiserFS completely. > > [If there is a more-appropriate place for me to send this---such as > a particular Mandrake list, or a particular ReiserFS list---please let > me know, particularly if I can get a quick answer -without- going > through the overhead of subscribing to the list, being flooded, and > unsubscribing---that's what archives are for. Some websearching > for "ReiserFS corruption" yielded -thousands- of hits---not a good > sign---and a very large proportion of them were on this list, so I > figure this is as good a place to ask as any. Thanks again.] > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-10-01 15:27 ` Hans Reiser @ 2001-10-03 16:17 ` Stephen C. Tweedie 2001-10-03 20:06 ` Pascal Schmidt 0 siblings, 1 reply; 24+ messages in thread From: Stephen C. Tweedie @ 2001-10-03 16:17 UTC (permalink / raw) To: Hans Reiser; +Cc: foner-reiserfs, linux-kernel, Stephen Tweedie Hi, On Mon, Oct 01, 2001 at 07:27:31PM +0400, Hans Reiser wrote: > This is the meaning of metadata journaling: that writes in progress at the time > of the crash may write garbage, but you won't need to fsck. You can get this > behaviour with other filesystems like FFS also. If you cannot accept those > terms of service, you might use ext3 with data journaling on, but then your > performance will be far worse. ext3 with ordered data writes has performance nearly up to the level of the fast-and-loose writeback mode for most workloads, and still avoids ever exposing stale disk blocks after a crash. Sure, it's a tradeoff, but there are positions between the two extremes (totally unordered data writes, and totally journaled data writes) which offer a good compromise here. Cheers, Stephen ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-10-03 16:17 ` Stephen C. Tweedie @ 2001-10-03 20:06 ` Pascal Schmidt 2001-10-04 11:02 ` Stephen C. Tweedie 0 siblings, 1 reply; 24+ messages in thread From: Pascal Schmidt @ 2001-10-03 20:06 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-kernel On Wed, 3 Oct 2001, Stephen C. Tweedie wrote: > ext3 with ordered data writes has performance nearly up to the level > of the fast-and-loose writeback mode for most workloads, and still > avoids ever exposing stale disk blocks after a crash. What if the machine crashes with parts of the data blocks written to disk, before the commit block gets submitted to the drive? The journal will tell us that the write transaction hasn't finished, but that doesn't mean that no data blocks made it to disk, right? We won't expose stale disk blocks, right, but there is still a mix between new and old file data in this situation. I assume e2fsck will warn about this? -- Ciao, Pascal -<[ pharao90@tzi.de, netmail 2:241/215.72, home http://cobol.cjb.net/) ]>- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: ReiserFS data corruption in very simple configuration 2001-10-03 20:06 ` Pascal Schmidt @ 2001-10-04 11:02 ` Stephen C. Tweedie 0 siblings, 0 replies; 24+ messages in thread From: Stephen C. Tweedie @ 2001-10-04 11:02 UTC (permalink / raw) To: Pascal Schmidt; +Cc: Stephen C. Tweedie, linux-kernel Hi, On Wed, Oct 03, 2001 at 10:06:58PM +0200, Pascal Schmidt wrote: > On Wed, 3 Oct 2001, Stephen C. Tweedie wrote: > > > ext3 with ordered data writes has performance nearly up to the level > > of the fast-and-loose writeback mode for most workloads, and still > > avoids ever exposing stale disk blocks after a crash. > What if the machine crashes with parts of the data blocks written to > disk, before the commit block gets submitted to the drive? In most cases, users write data by extending off the end of a file. Only in a few cases (such as databases) do you ever write into the middle of an existing file. Even overwriting an existing file is done by first truncating the file and then extending it again. If you crash during such an extend, then the data blocks may have been partially written, but the extend will not have been, so the incompletely-written data blocks will not be part of any file. The *only* way to get mis-ordered data blocks in ordered mode after a crash is if you are overwriting in the middle of an existing file. In such a case there is no absolute guarantee about write ordering unless you use fsync() or O_SYNC to force writes in a particular order. In journaled data mode, even mid-file overwrites will be strictly ordered after a crash. Cheers, Stephen ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2001-10-14 23:32 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-09-22 10:00 ReiserFS data corruption in very simple configuration foner-reiserfs 2001-09-22 12:47 ` Nikita Danilov 2001-09-22 20:44 ` foner-reiserfs 2001-09-25 13:28 ` Stephen C. Tweedie 2001-09-29 4:44 ` Lenny Foner 2001-09-29 12:52 ` [reiserfs-list] " Lehmann 2001-10-01 1:00 ` foner-reiserfs 2001-10-01 1:26 ` Lehmann 2001-10-01 2:32 ` foner-reiserfs 2001-10-03 16:28 ` Toby Dickenson 2001-10-01 11:30 ` Stephen C. Tweedie 2001-09-24 9:25 ` [reiserfs-list] " Jens Benecke 2001-10-14 14:52 ` Chris Mason 2001-10-14 18:19 ` Jens Benecke 2001-10-14 20:04 ` Hans Reiser 2001-10-14 23:32 ` Bernd Eckenfels 2001-09-25 20:13 ` Mike Fedyk 2001-09-26 14:43 ` Stephen C. Tweedie 2001-10-01 3:38 ` Mike Fedyk 2001-10-03 16:14 ` Stephen C. Tweedie 2001-10-01 15:27 ` Hans Reiser 2001-10-03 16:17 ` Stephen C. Tweedie 2001-10-03 20:06 ` Pascal Schmidt 2001-10-04 11:02 ` Stephen C. Tweedie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox