* document ext3 requirements
@ 2009-01-03 12:38 Pavel Machek
2009-01-03 21:17 ` Martin MOKREJŠ
` (3 more replies)
0 siblings, 4 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-03 12:38 UTC (permalink / raw)
To: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
linux-doc
Using ext3 is only safe if storage subsystem meets certain
criteria. Document those.
Errors=remount-ro is documented as default, but superblock setting
overrides that and mkfs defaults to errors=continue... so the default
is errors=continue in practice.
readonly mount does actually write to the media in some cases. Document that.
Signed-off-by: Pavel Machek <pavel@suse.cz>
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..74a73b0 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -14,6 +14,9 @@ Options
When mounting an ext3 filesystem, the following option are accepted:
(*) == default
+ro Note that ext3 will replay the journal (and thus write
+ to the partition) even when mounted "read only".
+
journal=update Update the ext3 file system's journal to the current
format.
@@ -95,6 +98,8 @@ debug Extra debugging information is sent to syslog.
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
+ (Note that default is overriden by superblock
+ setting on most systems).
data_err=ignore(*) Just print an error message if an error occurs
in a file data buffer in ordered mode.
@@ -188,6 +193,34 @@ mke2fs: create a ext3 partition with the -j flag.
debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* writes to media never fail. Even if disk returns error condition during
+ write, ext3 can't handle that correctly, because success on fsync was already
+ returned when data hit the journal.
+
+ (Fortunately writes failing are very uncommon on disks, as they
+ have spare sectors they use when write fails.)
+
+* either whole sector is correctly written or nothing is written during
+ powerfail.
+
+ (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
+ like this, and are unsuitable for ext3. Because RAM tends to fail
+ faster than rest of system during powerfail, special hw killing
+ DMA transfers may be neccessary. Not sure how common that problem
+ is on generic PC machines).
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+ (Note that barriers are disabled by default, use "barrier=1"
+ mount option after making sure hw can support them).
+
References
==========
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: document ext3 requirements 2009-01-03 12:38 document ext3 requirements Pavel Machek @ 2009-01-03 21:17 ` Martin MOKREJŠ 2009-01-03 22:06 ` Pavel Machek 2009-01-03 22:17 ` Duane Griffin 2009-01-04 2:32 ` Theodore Tso ` (2 subsequent siblings) 3 siblings, 2 replies; 67+ messages in thread From: Martin MOKREJŠ @ 2009-01-03 21:17 UTC (permalink / raw) To: Pavel Machek Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc Can one avoid replay of the journal then if it would be unclean? Just curious. M. Pavel Machek wrote: > Using ext3 is only safe if storage subsystem meets certain > criteria. Document those. > > Errors=remount-ro is documented as default, but superblock setting > overrides that and mkfs defaults to errors=continue... so the default > is errors=continue in practice. > > readonly mount does actually write to the media in some cases. Document that. > > Signed-off-by: Pavel Machek <pavel@suse.cz> > > diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 21:17 ` Martin MOKREJŠ @ 2009-01-03 22:06 ` Pavel Machek 2009-01-03 22:17 ` Duane Griffin 1 sibling, 0 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-03 22:06 UTC (permalink / raw) To: Martin MOKREJŠ Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Sat 2009-01-03 22:17:11, Martin MOKREJŠ wrote: > Can one avoid replay of the journal then if it would be unclean? > Just curious. Well, mounting unclean filesystem is dangerous but depending on circumstances, it may be better than writing to the filesystems. (You may not be able to read some data and may provoke kernel bugs, but at least you don't damage what is on disk. If you are collecting evidence -- not writing is very important. If you suspect something is very wrong with the drive, not writing is good idea). Pavel > > Pavel Machek wrote: > > Using ext3 is only safe if storage subsystem meets certain > > criteria. Document those. > > > > Errors=remount-ro is documented as default, but superblock setting > > overrides that and mkfs defaults to errors=continue... so the default > > is errors=continue in practice. > > > > readonly mount does actually write to the media in some cases. Document that. > > > > Signed-off-by: Pavel Machek <pavel@suse.cz> > > > > diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 21:17 ` Martin MOKREJŠ 2009-01-03 22:06 ` Pavel Machek @ 2009-01-03 22:17 ` Duane Griffin 2009-01-03 22:29 ` Pavel Machek 1 sibling, 1 reply; 67+ messages in thread From: Duane Griffin @ 2009-01-03 22:17 UTC (permalink / raw) To: Martin MOKREJŠ Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc [Fixed top-posting] 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: > Pavel Machek wrote: >> readonly mount does actually write to the media in some cases. Document that. >> > Can one avoid replay of the journal then if it would be unclean? > Just curious. Nope. If the underlying block device is read-only then mounting the filesystem will fail. I tried to fix this some time ago, and have a set of patches that almost always work, but "almost always" isn't good enough. Unfortunately I never managed to figure out a way to finish it off without disgusting hacks or major surgery. > M. Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 22:17 ` Duane Griffin @ 2009-01-03 22:29 ` Pavel Machek 2009-01-03 23:01 ` Martin MOKREJŠ ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-03 22:29 UTC (permalink / raw) To: Duane Griffin Cc: Martin MOKREJŠ, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Sat 2009-01-03 22:17:15, Duane Griffin wrote: > [Fixed top-posting] > > 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: > > Pavel Machek wrote: > >> readonly mount does actually write to the media in some cases. Document that. > >> > > Can one avoid replay of the journal then if it would be unclean? > > Just curious. > > Nope. If the underlying block device is read-only then mounting the > filesystem will fail. I tried to fix this some time ago, and have a > set of patches that almost always work, but "almost always" isn't good > enough. Unfortunately I never managed to figure out a way to finish it > off without disgusting hacks or major surgery. Uhuh, can you just ignore the journal and mount it anyway? ...basically treating it like an ext2? ...ok, that will present "old" version of the filesystem to the user... violating fsync() semantics. Still handy for recovering badly broken filesystems, I'd say. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 22:29 ` Pavel Machek @ 2009-01-03 23:01 ` Martin MOKREJŠ 2009-01-03 23:38 ` Duane Griffin ` (3 more replies) 2009-01-03 23:12 ` Duane Griffin 2009-01-06 10:06 ` Matthias Andree 2 siblings, 4 replies; 67+ messages in thread From: Martin MOKREJŠ @ 2009-01-03 23:01 UTC (permalink / raw) To: Pavel Machek Cc: Duane Griffin, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc Pavel Machek wrote: > On Sat 2009-01-03 22:17:15, Duane Griffin wrote: >> [Fixed top-posting] >> >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: >>> Pavel Machek wrote: >>>> readonly mount does actually write to the media in some cases. Document that. >>>> >>> Can one avoid replay of the journal then if it would be unclean? >>> Just curious. >> Nope. If the underlying block device is read-only then mounting the >> filesystem will fail. I tried to fix this some time ago, and have a >> set of patches that almost always work, but "almost always" isn't good >> enough. Unfortunately I never managed to figure out a way to finish it >> off without disgusting hacks or major surgery. > > Uhuh, can you just ignore the journal and mount it anyway? > ...basically treating it like an ext2? > > ...ok, that will present "old" version of the filesystem to the > user... violating fsync() semantics. Hmm, so if my dual-boot machine does not shutdown correctly and I boot accidentally in M$ Win where I use ext2 IFS driver and modify some stuff on the ext3 drive, after a while reboot to linux and the journal get re-played ... Mmm ... > > Still handy for recovering badly broken filesystems, I'd say. Me as well. How about improving you doc patch with some summary of this thread (although it is probably not over yet)? ;-) Definitely, a note that one can mount it as ext2 while read-only would be helpful when doing some forensics on the disk. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 23:01 ` Martin MOKREJŠ @ 2009-01-03 23:38 ` Duane Griffin 2009-01-03 23:50 ` Martin MOKREJŠ 2009-01-04 0:19 ` Pavel Machek ` (2 subsequent siblings) 3 siblings, 1 reply; 67+ messages in thread From: Duane Griffin @ 2009-01-03 23:38 UTC (permalink / raw) To: Martin MOKREJŠ Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: > Hmm, so if my dual-boot machine does not shutdown correctly and I boot > accidentally in M$ Win where I use ext2 IFS driver and modify some > stuff on the ext3 drive, after a while reboot to linux and the journal > get re-played ... Mmm ... You *really* wouldn't want to be doing that. The other scenario that people have reported trouble with is suspending the system, booting a live CD which "read-only" mounts the filesystem (and replays the journal), then resuming. Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 23:38 ` Duane Griffin @ 2009-01-03 23:50 ` Martin MOKREJŠ 2009-01-03 23:58 ` Robert Hancock 2009-01-04 0:00 ` Duane Griffin 0 siblings, 2 replies; 67+ messages in thread From: Martin MOKREJŠ @ 2009-01-03 23:50 UTC (permalink / raw) To: Duane Griffin Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc Duane Griffin wrote: > 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: >> Hmm, so if my dual-boot machine does not shutdown correctly and I boot >> accidentally in M$ Win where I use ext2 IFS driver and modify some >> stuff on the ext3 drive, after a while reboot to linux and the journal >> get re-played ... Mmm ... > > You *really* wouldn't want to be doing that. > > The other scenario that people have reported trouble with is > suspending the system, booting a live CD which "read-only" mounts the > filesystem (and replays the journal), then resuming. Why does not "mount -ro" die when it would have to replay the journal with a message that user must run fsck.ext3 in order to be able to mount it albeit read-only? Still I would prefer having an extra switch to force mount RO while not touching the journal for disk forensics. I think that would also prevent the cases when a LiveCD/rescue distribution would not mount+replay it automagically but user would really have to provide the switch to the command. I am really not using the recovery boot cd to touch my partitions in some cases unwillingly. Sure that does not prevent my case when I let ext2 IFS writing onto my ext3 partition. Actually, couldn't the driver at least warn me the journal log is non-empty (am just a user, sorry, cannot check myself the code at www.fs-driver.org if it could do at least this although it does not understand ext3). ;-) Martin ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 23:50 ` Martin MOKREJŠ @ 2009-01-03 23:58 ` Robert Hancock 2009-01-04 0:08 ` Martin MOKREJŠ 2009-01-04 21:49 ` Ingo Oeser 2009-01-04 0:00 ` Duane Griffin 1 sibling, 2 replies; 67+ messages in thread From: Robert Hancock @ 2009-01-03 23:58 UTC (permalink / raw) To: Martin MOKREJŠ Cc: Duane Griffin, Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc Martin MOKREJŠ wrote: > Duane Griffin wrote: >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: >>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot >>> accidentally in M$ Win where I use ext2 IFS driver and modify some >>> stuff on the ext3 drive, after a while reboot to linux and the journal >>> get re-played ... Mmm ... >> You *really* wouldn't want to be doing that. >> >> The other scenario that people have reported trouble with is >> suspending the system, booting a live CD which "read-only" mounts the >> filesystem (and replays the journal), then resuming. > > Why does not "mount -ro" die when it would have to replay the journal > with a message that user must run fsck.ext3 in order to be able to mount > it albeit read-only? Still I would prefer having an extra switch to That would break typical system bootup in the unclean journal case, normally the root FS is mounted read-only to start with (which replays the journal) and remounted read-write later on - and usually the fsck utilities are located on the root filesystem.. > force mount RO while not touching the journal for disk forensics. > I think that would also prevent the cases when a LiveCD/rescue distribution > would not mount+replay it automagically but user would really have to > provide the switch to the command. I am really not using the recovery > boot cd to touch my partitions in some cases unwillingly. I agree, there should be a way to force it to mount "really read only" so it doesn't try to replay the journal. That might require just ignoring the journal content, which may result in the FS appearing corrupt, but for recovery/forensics purposes that seems better than nothing.. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 23:58 ` Robert Hancock @ 2009-01-04 0:08 ` Martin MOKREJŠ 2009-01-04 21:49 ` Ingo Oeser 1 sibling, 0 replies; 67+ messages in thread From: Martin MOKREJŠ @ 2009-01-04 0:08 UTC (permalink / raw) To: Robert Hancock Cc: Duane Griffin, Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc Robert Hancock wrote: > Martin MOKREJŠ wrote: >> Duane Griffin wrote: >>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: >>>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot >>>> accidentally in M$ Win where I use ext2 IFS driver and modify some >>>> stuff on the ext3 drive, after a while reboot to linux and the journal >>>> get re-played ... Mmm ... >>> You *really* wouldn't want to be doing that. >>> >>> The other scenario that people have reported trouble with is >>> suspending the system, booting a live CD which "read-only" mounts the >>> filesystem (and replays the journal), then resuming. >> >> Why does not "mount -ro" die when it would have to replay the journal >> with a message that user must run fsck.ext3 in order to be able to mount >> it albeit read-only? Still I would prefer having an extra switch to > > That would break typical system bootup in the unclean journal case, > normally the root FS is mounted read-only to start with (which replays > the journal) and remounted read-write later on - and usually the fsck > utilities are located on the root filesystem.. Couldn't that be handled by e.g. openRC during boot, to provide the say to be provided --force-journal-replay during "normal" boot? Yes, that would mean e2fsprogs would become incompatible with older versions but why not "fix" the logic? > >> force mount RO while not touching the journal for disk forensics. I >> think that would also prevent the cases when a LiveCD/rescue >> distribution would not mount+replay it automagically but user would >> really have to provide the switch to the command. I am really not >> using the recovery boot cd to touch my partitions in some cases >> unwillingly. > > I agree, there should be a way to force it to mount "really read only" > so it doesn't try to replay the journal. That might require just > ignoring the journal content, which may result in the FS appearing > corrupt, but for recovery/forensics purposes that seems better than > nothing.. Fully agree. M. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 23:58 ` Robert Hancock 2009-01-04 0:08 ` Martin MOKREJŠ @ 2009-01-04 21:49 ` Ingo Oeser 1 sibling, 0 replies; 67+ messages in thread From: Ingo Oeser @ 2009-01-04 21:49 UTC (permalink / raw) To: Robert Hancock Cc: Martin MOKREJŠ, Duane Griffin, Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Sunday 04 January 2009, Robert Hancock wrote: > I agree, there should be a way to force it to mount "really read only" > so it doesn't try to replay the journal. That might require just > ignoring the journal content, which may result in the FS appearing > corrupt, but for recovery/forensics purposes that seems better than > nothing.. For forensics you ALWAYS get a copy of the full disk first, which you set read only with blockdev --setro /dev/$MYDISK. You then restore from this copy. Best Regard Ingo Oeser, been there, done that ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 23:50 ` Martin MOKREJŠ 2009-01-03 23:58 ` Robert Hancock @ 2009-01-04 0:00 ` Duane Griffin 2009-01-04 0:11 ` Martin MOKREJŠ 1 sibling, 1 reply; 67+ messages in thread From: Duane Griffin @ 2009-01-04 0:00 UTC (permalink / raw) To: Martin MOKREJŠ Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: > Why does not "mount -ro" die when it would have to replay the journal > with a message that user must run fsck.ext3 in order to be able to mount > it albeit read-only? Still I would prefer having an extra switch to > force mount RO while not touching the journal for disk forensics. > I think that would also prevent the cases when a LiveCD/rescue distribution > would not mount+replay it automagically but user would really have to > provide the switch to the command. I am really not using the recovery > boot cd to touch my partitions in some cases unwillingly. Well, that would make things rather tricky. As in, shutting down uncleanly would render your system unbootable. > Sure that does not prevent my case when I let ext2 IFS writing onto > my ext3 partition. Actually, couldn't the driver at least warn me > the journal log is non-empty (am just a user, sorry, cannot check > myself the code at www.fs-driver.org if it could do at least this > although it does not understand ext3). ;-) The driver certainly should warn you in that case. I have no idea whether it does, as I don't use it, sorry. Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 0:00 ` Duane Griffin @ 2009-01-04 0:11 ` Martin MOKREJŠ 2009-01-04 0:41 ` Duane Griffin 0 siblings, 1 reply; 67+ messages in thread From: Martin MOKREJŠ @ 2009-01-04 0:11 UTC (permalink / raw) To: Duane Griffin Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc Duane Griffin wrote: > 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: >> Why does not "mount -ro" die when it would have to replay the journal >> with a message that user must run fsck.ext3 in order to be able to mount >> it albeit read-only? Still I would prefer having an extra switch to >> force mount RO while not touching the journal for disk forensics. >> I think that would also prevent the cases when a LiveCD/rescue distribution >> would not mount+replay it automagically but user would really have to >> provide the switch to the command. I am really not using the recovery >> boot cd to touch my partitions in some cases unwillingly. > > Well, that would make things rather tricky. As in, shutting down > uncleanly would render your system unbootable. ??? If I am booted off a CD/DVD drive I just do not want my system to be touched. I am fine if the dist mounts my drives automagically in read-only mode but if that currently forces journal replay then no, thanks. ;) M. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 0:11 ` Martin MOKREJŠ @ 2009-01-04 0:41 ` Duane Griffin 2009-01-04 3:52 ` Valdis.Kletnieks 0 siblings, 1 reply; 67+ messages in thread From: Duane Griffin @ 2009-01-04 0:41 UTC (permalink / raw) To: Martin MOKREJŠ Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc 2009/1/4 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: > Duane Griffin wrote: >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: >>> Why does not "mount -ro" die when it would have to replay the journal >>> with a message that user must run fsck.ext3 in order to be able to mount >>> it albeit read-only? Still I would prefer having an extra switch to >>> force mount RO while not touching the journal for disk forensics. >>> I think that would also prevent the cases when a LiveCD/rescue distribution >>> would not mount+replay it automagically but user would really have to >>> provide the switch to the command. I am really not using the recovery >>> boot cd to touch my partitions in some cases unwillingly. >> >> Well, that would make things rather tricky. As in, shutting down >> uncleanly would render your system unbootable. > > ??? If I am booted off a CD/DVD drive I just do not want my system > to be touched. I am fine if the dist mounts my drives automagically > in read-only mode but if that currently forces journal replay then no, > thanks. ;) I agree, it isn't a great situation. Nonetheless, it has always been thus for ext3, and so far we've muddled along. Unless and until we can replay the journal in-memory without touching the on-disk data, we are stuck with it. We can't refuse to mount an unclean FS, as that would break booting. We also can't ignore the journal by default, if/when we get a patch to do so at all, as that effectively corrupts random chunks of the FS. Fine for forensics and recovery; not so much for booting from. > M. Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 0:41 ` Duane Griffin @ 2009-01-04 3:52 ` Valdis.Kletnieks 2009-01-04 14:24 ` Duane Griffin 0 siblings, 1 reply; 67+ messages in thread From: Valdis.Kletnieks @ 2009-01-04 3:52 UTC (permalink / raw) To: Duane Griffin Cc: Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc [-- Attachment #1: Type: text/plain, Size: 665 bytes --] On Sun, 04 Jan 2009 00:41:51 GMT, Duane Griffin said: > I agree, it isn't a great situation. Nonetheless, it has always been > thus for ext3, and so far we've muddled along. Unless and until we can > replay the journal in-memory without touching the on-disk data, we are > stuck with it. Is there a way using md/dm/lvm etc to make the source partition R/O and replay the journal onto a CoW snapshop? Admittedly, not easy to do inside the 'mount' command itself, but at least it might be workable for LiveCD R/O mounts and forensics work, where you can *tell* beforehand that's what you want and can jump through setup games before doing the mount... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 3:52 ` Valdis.Kletnieks @ 2009-01-04 14:24 ` Duane Griffin 2009-01-04 18:40 ` Theodore Tso 0 siblings, 1 reply; 67+ messages in thread From: Duane Griffin @ 2009-01-04 14:24 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc 2009/1/4 <Valdis.Kletnieks@vt.edu>: > On Sun, 04 Jan 2009 00:41:51 GMT, Duane Griffin said: > >> I agree, it isn't a great situation. Nonetheless, it has always been >> thus for ext3, and so far we've muddled along. Unless and until we can >> replay the journal in-memory without touching the on-disk data, we are >> stuck with it. > > Is there a way using md/dm/lvm etc to make the source partition R/O and > replay the journal onto a CoW snapshop? Admittedly, not easy to do inside > the 'mount' command itself, but at least it might be workable for LiveCD R/O > mounts and forensics work, where you can *tell* beforehand that's what you > want and can jump through setup games before doing the mount... Yes, something like that is best practice, as I understand it. The LiveCD init scripts could check whether they are about to R/O mount an ext[34] filesystem needing recovery and either refuse with a useful message to the user, or even automatically create and mount a COW snapshot, as you described. They'd still need to warn the user though, since things like remounting R/W wouldn't work as expected. Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 14:24 ` Duane Griffin @ 2009-01-04 18:40 ` Theodore Tso 2009-01-04 19:21 ` Geert Uytterhoeven 0 siblings, 1 reply; 67+ messages in thread From: Theodore Tso @ 2009-01-04 18:40 UTC (permalink / raw) To: Duane Griffin Cc: Valdis.Kletnieks, Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote: > > Is there a way using md/dm/lvm etc to make the source partition R/O and > > replay the journal onto a CoW snapshop? Admittedly, not easy to do inside > > the 'mount' command itself, but at least it might be workable for LiveCD R/O > > mounts and forensics work, where you can *tell* beforehand that's what you > > want and can jump through setup games before doing the mount... > > Yes, something like that is best practice, as I understand it. The > LiveCD init scripts could check whether they are about to R/O mount an > ext[34] filesystem needing recovery and either refuse with a useful > message to the user, or even automatically create and mount a COW > snapshot, as you described. They'd still need to warn the user though, > since things like remounting R/W wouldn't work as expected. So what's the use case where people want to be able to mount a filesystem needing recovery read/only without running the journal? - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 18:40 ` Theodore Tso @ 2009-01-04 19:21 ` Geert Uytterhoeven 2009-01-04 19:36 ` Theodore Tso ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Geert Uytterhoeven @ 2009-01-04 19:21 UTC (permalink / raw) To: Theodore Tso Cc: Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun, 4 Jan 2009, Theodore Tso wrote: > On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote: > > > Is there a way using md/dm/lvm etc to make the source partition R/O and > > > replay the journal onto a CoW snapshop? Admittedly, not easy to do inside > > > the 'mount' command itself, but at least it might be workable for LiveCD R/O > > > mounts and forensics work, where you can *tell* beforehand that's what you > > > want and can jump through setup games before doing the mount... > > > > Yes, something like that is best practice, as I understand it. The > > LiveCD init scripts could check whether they are about to R/O mount an > > ext[34] filesystem needing recovery and either refuse with a useful > > message to the user, or even automatically create and mount a COW > > snapshot, as you described. They'd still need to warn the user though, > > since things like remounting R/W wouldn't work as expected. > > So what's the use case where people want to be able to mount a > filesystem needing recovery read/only without running the journal? As mentioned before, suspending a laptop (running from hdd), running a live CD, and expecting everything to work fine when resuming from hdd? I think most people get shocked when they discover that mounting something read-only may actualy write to the media. This is a bit unexpected (hey, if I mount `read-only', I expect that no writes will happen), as it behaved differently before the introduction of journalling. As for mounting the root file system read-only during early boot up, and remounting it read-write later, I guess it's quite complicated to replay the journal (in RAM) on read-only mount, and deferring the replay writeback until remounting read-write? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 19:21 ` Geert Uytterhoeven @ 2009-01-04 19:36 ` Theodore Tso 2009-01-04 19:51 ` Duane Griffin 2009-01-04 22:42 ` Bron Gondwana 2009-01-05 3:22 ` Rob Landley 2 siblings, 1 reply; 67+ messages in thread From: Theodore Tso @ 2009-01-04 19:36 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote: > As mentioned before, suspending a laptop (running from hdd), running > a live CD, and expecting everything to work fine when resuming from > hdd? > > I think most people get shocked when they discover that mounting > something read-only may actualy write to the media. This is a bit > unexpected (hey, if I mount `read-only', I expect that no writes > will happen), as it behaved differently before the introduction of > journalling. It's been this way for about a decade.... that being said, if you really want to do this, you can today via "mount -o ro,noload /dev/XXX /mntpt". However, the system could crash or fail because the filesystem without having run the journal could be quite inconsistent. > As for mounting the root file system read-only during early boot up, and > remounting it read-write later, I guess it's quite complicated to replay the > journal (in RAM) on read-only mount, and deferring the replay writeback until > remounting read-write? It's not *that* hard; if someone would like to cons up a patch, please feel free.... but it's certainly not a high priority for me or most of the other ext3 filesystem developers. - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 19:36 ` Theodore Tso @ 2009-01-04 19:51 ` Duane Griffin 2009-01-04 21:55 ` Theodore Tso 0 siblings, 1 reply; 67+ messages in thread From: Duane Griffin @ 2009-01-04 19:51 UTC (permalink / raw) To: Theodore Tso, Geert Uytterhoeven, Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc 2009/1/4 Theodore Tso <tytso@mit.edu>: > On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote: >> As for mounting the root file system read-only during early boot up, and >> remounting it read-write later, I guess it's quite complicated to replay the >> journal (in RAM) on read-only mount, and deferring the replay writeback until >> remounting read-write? > > It's not *that* hard; if someone would like to cons up a patch, please > feel free.... but it's certainly not a high priority for me or most > of the other ext3 filesystem developers. If anyone is interested I'd be happy to dust off and send them my old patches to implement this. There are a couple of issues with it. First, I never got around to implementing remount R/W support. Second, I had to introduce a rather nasty hack in order to handle un-escaping JFS magic numbers. > - Ted Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 19:51 ` Duane Griffin @ 2009-01-04 21:55 ` Theodore Tso 2009-01-04 22:06 ` Duane Griffin 0 siblings, 1 reply; 67+ messages in thread From: Theodore Tso @ 2009-01-04 21:55 UTC (permalink / raw) To: Duane Griffin Cc: Geert Uytterhoeven, Valdis.Kletnieks, Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun, Jan 04, 2009 at 07:51:27PM +0000, Duane Griffin wrote: > > If anyone is interested I'd be happy to dust off and send them my old > patches to implement this. There are a couple of issues with it. > First, I never got around to implementing remount R/W support. Second, > I had to introduce a rather nasty hack in order to handle un-escaping > JFS magic numbers. Can you dust off the patches and send a copy to linux-ext4@vger.kernel.org so we have them archived someplace where hopefully someone might have time to look at it? - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 21:55 ` Theodore Tso @ 2009-01-04 22:06 ` Duane Griffin 0 siblings, 0 replies; 67+ messages in thread From: Duane Griffin @ 2009-01-04 22:06 UTC (permalink / raw) To: Theodore Tso, Duane Griffin, Geert Uytterhoeven, Valdis.Kletnieks, Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc 2009/1/4 Theodore Tso <tytso@mit.edu>: > On Sun, Jan 04, 2009 at 07:51:27PM +0000, Duane Griffin wrote: >> >> If anyone is interested I'd be happy to dust off and send them my old >> patches to implement this. There are a couple of issues with it. >> First, I never got around to implementing remount R/W support. Second, >> I had to introduce a rather nasty hack in order to handle un-escaping >> JFS magic numbers. > > Can you dust off the patches and send a copy to > linux-ext4@vger.kernel.org so we have them archived someplace where > hopefully someone might have time to look at it? OK, will do. I've posted them there before, but not the latest version that properly handles un-escaping JFS magic numbers (albeit in an ugly way). I'll rebase them on top of the latest ext4 patch queue and repost. > - Ted Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 19:21 ` Geert Uytterhoeven 2009-01-04 19:36 ` Theodore Tso @ 2009-01-04 22:42 ` Bron Gondwana 2009-01-05 3:22 ` Rob Landley 2 siblings, 0 replies; 67+ messages in thread From: Bron Gondwana @ 2009-01-04 22:42 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Theodore Tso, Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote: > On Sun, 4 Jan 2009, Theodore Tso wrote: > > On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote: > > > > Is there a way using md/dm/lvm etc to make the source partition R/O and > > > > replay the journal onto a CoW snapshop? Admittedly, not easy to do inside > > > > the 'mount' command itself, but at least it might be workable for LiveCD R/O > > > > mounts and forensics work, where you can *tell* beforehand that's what you > > > > want and can jump through setup games before doing the mount... > > > > > > Yes, something like that is best practice, as I understand it. The > > > LiveCD init scripts could check whether they are about to R/O mount an > > > ext[34] filesystem needing recovery and either refuse with a useful > > > message to the user, or even automatically create and mount a COW > > > snapshot, as you described. They'd still need to warn the user though, > > > since things like remounting R/W wouldn't work as expected. > > > > So what's the use case where people want to be able to mount a > > filesystem needing recovery read/only without running the journal? > > As mentioned before, suspending a laptop (running from hdd), running a live CD, > and expecting everything to work fine when resuming from hdd? Any particular reason why suspend doesn't run the journal during shutdown and leave a clean filesystem? It shouldn't take that long surely. I know it doesn't solve the "it really just crashed" problem, but you don't tend to unsuspend from a crash anyway. Bron ( just curious ) ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 19:21 ` Geert Uytterhoeven 2009-01-04 19:36 ` Theodore Tso 2009-01-04 22:42 ` Bron Gondwana @ 2009-01-05 3:22 ` Rob Landley 2 siblings, 0 replies; 67+ messages in thread From: Rob Landley @ 2009-01-05 3:22 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Theodore Tso, Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sunday 04 January 2009 13:21:06 Geert Uytterhoeven wrote: > I think most people get shocked when they discover that mounting something > read-only may actualy write to the media. This is a bit unexpected (hey, if > I mount `read-only', I expect that no writes will happen), as it behaved > differently before the introduction of journalling. Is this an unreasonable use case: kill -STOP $(pidof qemu) mount -o loop,ro hdb.img blah cp blah/thingy thingy umount blah kill -CONT $(pidof qemu) Currently, if your loopback mount is -t ext3 it'll write to the block device, and if your mount is -t ext2 it'll refuse to work on an unclean ext3 filesystem, even if it's read only. (But it _will_ work on an unclean ext2 filesystem.) My theory when I first found out about this was "the filesystem developers hate me personally". Rob ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 23:01 ` Martin MOKREJŠ 2009-01-03 23:38 ` Duane Griffin @ 2009-01-04 0:19 ` Pavel Machek 2009-01-05 2:55 ` Rob Landley 2009-01-04 19:56 ` Rob Landley 2009-01-06 10:08 ` Matthias Andree 3 siblings, 1 reply; 67+ messages in thread From: Pavel Machek @ 2009-01-04 0:19 UTC (permalink / raw) To: Martin MOKREJŠ Cc: Duane Griffin, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Sun 2009-01-04 00:01:58, Martin MOKREJŠ wrote: > Pavel Machek wrote: > > On Sat 2009-01-03 22:17:15, Duane Griffin wrote: > >> [Fixed top-posting] > >> > >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: > >>> Pavel Machek wrote: > >>>> readonly mount does actually write to the media in some cases. Document that. > >>>> > >>> Can one avoid replay of the journal then if it would be unclean? > >>> Just curious. > >> Nope. If the underlying block device is read-only then mounting the > >> filesystem will fail. I tried to fix this some time ago, and have a > >> set of patches that almost always work, but "almost always" isn't good > >> enough. Unfortunately I never managed to figure out a way to finish it > >> off without disgusting hacks or major surgery. > > > > Uhuh, can you just ignore the journal and mount it anyway? > > ...basically treating it like an ext2? > > > > ...ok, that will present "old" version of the filesystem to the > > user... violating fsync() semantics. > > Hmm, so if my dual-boot machine does not shutdown correctly and I boot > accidentally in M$ Win where I use ext2 IFS driver and modify some > stuff on the ext3 drive, after a while reboot to linux and the journal > get re-played ... Mmm ... ext2 driver should refuse to mount dirty ext3 filesystem. (Linux ext2 driver does that). > > Still handy for recovering badly broken filesystems, I'd say. > > Me as well. How about improving you doc patch with some summary of > this thread (although it is probably not over yet)? ;-) Definitely, > a note that one can mount it as ext2 while read-only would be helpful > when doing some forensics on the disk. No, you can't mount unclean ext3 as an ext2; patch to do that would be possible but... I believe the patch is correct & useful. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 0:19 ` Pavel Machek @ 2009-01-05 2:55 ` Rob Landley 0 siblings, 0 replies; 67+ messages in thread From: Rob Landley @ 2009-01-05 2:55 UTC (permalink / raw) To: Pavel Machek Cc: Martin MOKREJŠ, Duane Griffin, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Saturday 03 January 2009 18:19:00 Pavel Machek wrote: > No, you can't mount unclean ext3 as an ext2; patch to do that would be > possible but... tune2fs -O ^has_journal /dev/blah fsck.ext2 -f /dev/blah Rob ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 23:01 ` Martin MOKREJŠ 2009-01-03 23:38 ` Duane Griffin 2009-01-04 0:19 ` Pavel Machek @ 2009-01-04 19:56 ` Rob Landley 2009-01-05 19:16 ` Theodore Tso 2009-01-06 10:08 ` Matthias Andree 3 siblings, 1 reply; 67+ messages in thread From: Rob Landley @ 2009-01-04 19:56 UTC (permalink / raw) To: Martin MOKREJŠ Cc: Pavel Machek, Duane Griffin, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote: > > Still handy for recovering badly broken filesystems, I'd say. > > Me as well. How about improving you doc patch with some summary of > this thread (although it is probably not over yet)? ;-) Definitely, > a note that one can mount it as ext2 while read-only would be helpful > when doing some forensics on the disk. Although make sure you _do_ mount it as read only because if you mount an ext3 filesystem read/write as ext2 I've had it zap the journal entirely and then you have to tune2fs -j the sucker to turn it back into ext3. Ext3 is... touchy. Rob ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 19:56 ` Rob Landley @ 2009-01-05 19:16 ` Theodore Tso 2009-01-06 19:20 ` Rob Landley 0 siblings, 1 reply; 67+ messages in thread From: Theodore Tso @ 2009-01-05 19:16 UTC (permalink / raw) To: Rob Landley Cc: Martin MOKREJŠ, Pavel Machek, Duane Griffin, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun, Jan 04, 2009 at 01:56:32PM -0600, Rob Landley wrote: > On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote: > > > Still handy for recovering badly broken filesystems, I'd say. > > > > Me as well. How about improving you doc patch with some summary of > > this thread (although it is probably not over yet)? ;-) Definitely, > > a note that one can mount it as ext2 while read-only would be helpful > > when doing some forensics on the disk. > > Although make sure you _do_ mount it as read only because if you mount an ext3 > filesystem read/write as ext2 I've had it zap the journal entirely and then > you have to tune2fs -j the sucker to turn it back into ext3. > > Ext3 is... touchy. Um.... horse pucky: # mke2fs -q -t ext3 /dev/thunk/footest # debugfs -R features /dev/thunk/footest debugfs 1.41.3 (12-Oct-2008) Filesystem features: has_journal ext_attr resize_inode dir_index filetype sparse_super large_file # mount -t ext2 /dev/thunk/footest /mnt # touch /mnt/foo # umount /mnt # debugfs -R features /dev/thunk/footest debugfs 1.41.3 (12-Oct-2008) Filesystem features: has_journal ext_attr resize_inode dir_index filetype sparse_super large_file - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-05 19:16 ` Theodore Tso @ 2009-01-06 19:20 ` Rob Landley 0 siblings, 0 replies; 67+ messages in thread From: Rob Landley @ 2009-01-06 19:20 UTC (permalink / raw) To: Theodore Tso Cc: Martin MOKREJŠ, Pavel Machek, Duane Griffin, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Monday 05 January 2009 13:16:58 Theodore Tso wrote: > On Sun, Jan 04, 2009 at 01:56:32PM -0600, Rob Landley wrote: > > On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote: > > > > Still handy for recovering badly broken filesystems, I'd say. > > > > > > Me as well. How about improving you doc patch with some summary of > > > this thread (although it is probably not over yet)? ;-) Definitely, > > > a note that one can mount it as ext2 while read-only would be helpful > > > when doing some forensics on the disk. > > > > Although make sure you _do_ mount it as read only because if you mount an > > ext3 filesystem read/write as ext2 I've had it zap the journal entirely > > and then you have to tune2fs -j the sucker to turn it back into ext3. > > > > Ext3 is... touchy. > > Um.... horse pucky: Well I managed to kill it more than once, but I could easily have the reproduction sequence wrong. (I wasn't _trying_ to do it again...) > # mke2fs -q -t ext3 /dev/thunk/footest > # debugfs -R features /dev/thunk/footest > debugfs 1.41.3 (12-Oct-2008) > Filesystem features: has_journal ext_attr resize_inode dir_index filetype > sparse_super large_file # mount -t ext2 /dev/thunk/footest /mnt > # touch /mnt/foo > # umount /mnt > # debugfs -R features /dev/thunk/footest > debugfs 1.41.3 (12-Oct-2008) > Filesystem features: has_journal ext_attr resize_inode dir_index filetype > sparse_super large_file If I can figure out what I did, I'll get back to you. > - Ted Rob ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 23:01 ` Martin MOKREJŠ ` (2 preceding siblings ...) 2009-01-04 19:56 ` Rob Landley @ 2009-01-06 10:08 ` Matthias Andree 2009-01-06 15:23 ` Theodore Tso 3 siblings, 1 reply; 67+ messages in thread From: Matthias Andree @ 2009-01-06 10:08 UTC (permalink / raw) To: Martin MOKREJŠ; +Cc: Duane Griffin, kernel list On Sun, 04 Jan 2009, Martin MOKREJŠ wrote: > Pavel Machek wrote: > > On Sat 2009-01-03 22:17:15, Duane Griffin wrote: > >> [Fixed top-posting] > >> > >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: > >>> Pavel Machek wrote: > >>>> readonly mount does actually write to the media in some cases. Document that. > >>>> > >>> Can one avoid replay of the journal then if it would be unclean? > >>> Just curious. > >> Nope. If the underlying block device is read-only then mounting the > >> filesystem will fail. I tried to fix this some time ago, and have a > >> set of patches that almost always work, but "almost always" isn't good > >> enough. Unfortunately I never managed to figure out a way to finish it > >> off without disgusting hacks or major surgery. > > > > Uhuh, can you just ignore the journal and mount it anyway? > > ...basically treating it like an ext2? > > > > ...ok, that will present "old" version of the filesystem to the > > user... violating fsync() semantics. > > Hmm, so if my dual-boot machine does not shutdown correctly and I boot > accidentally in M$ Win where I use ext2 IFS driver and modify some > stuff on the ext3 drive, after a while reboot to linux and the journal > get re-played ... Mmm ... If the ext2 IFS driver mounts an ext3 file system that needs journal replay, the IFS driver is broken (unless it can replay the journal, of course - I stopped using that driver long ago, being unhappy with it). -- Matthias Andree ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-06 10:08 ` Matthias Andree @ 2009-01-06 15:23 ` Theodore Tso 0 siblings, 0 replies; 67+ messages in thread From: Theodore Tso @ 2009-01-06 15:23 UTC (permalink / raw) To: Martin MOKREJŠ, Duane Griffin, kernel list On Tue, Jan 06, 2009 at 11:08:10AM +0100, Matthias Andree wrote: > On Sun, 04 Jan 2009, Martin MOKREJŠ wrote: > > Hmm, so if my dual-boot machine does not shutdown correctly and I boot > > accidentally in M$ Win where I use ext2 IFS driver and modify some > > stuff on the ext3 drive, after a while reboot to linux and the journal > > get re-played ... Mmm ... > > If the ext2 IFS driver mounts an ext3 file system that needs journal > replay, the IFS driver is broken (unless it can replay the journal, of > course - I stopped using that driver long ago, being unhappy with it). Indeed; that's why there is a INCOMPAT NEEDS_RECOVERY feature flag to prevent compliant ext2 implementations from mounting an ext3 filesystem that needs recovery. We've thought about most of these issues, almost a decade ago... - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 22:29 ` Pavel Machek 2009-01-03 23:01 ` Martin MOKREJŠ @ 2009-01-03 23:12 ` Duane Griffin 2009-01-06 10:06 ` Matthias Andree 2 siblings, 0 replies; 67+ messages in thread From: Duane Griffin @ 2009-01-03 23:12 UTC (permalink / raw) To: Pavel Machek Cc: Martin MOKREJŠ, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc 2009/1/3 Pavel Machek <pavel@suse.cz>: > On Sat 2009-01-03 22:17:15, Duane Griffin wrote: >> [Fixed top-posting] >> >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: >> > Pavel Machek wrote: >> >> readonly mount does actually write to the media in some cases. Document that. >> >> >> > Can one avoid replay of the journal then if it would be unclean? >> > Just curious. >> >> Nope. If the underlying block device is read-only then mounting the >> filesystem will fail. I tried to fix this some time ago, and have a >> set of patches that almost always work, but "almost always" isn't good >> enough. Unfortunately I never managed to figure out a way to finish it >> off without disgusting hacks or major surgery. > > Uhuh, can you just ignore the journal and mount it anyway? > ...basically treating it like an ext2? I'm afraid not, ext2 won't mount an FS with EXT3_FEATURE_INCOMPAT_RECOVER set. > ...ok, that will present "old" version of the filesystem to the > user... violating fsync() semantics. > > Still handy for recovering badly broken filesystems, I'd say. > > Pavel Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 22:29 ` Pavel Machek 2009-01-03 23:01 ` Martin MOKREJŠ 2009-01-03 23:12 ` Duane Griffin @ 2009-01-06 10:06 ` Matthias Andree 2 siblings, 0 replies; 67+ messages in thread From: Matthias Andree @ 2009-01-06 10:06 UTC (permalink / raw) To: kernel list On Sat, 03 Jan 2009, Pavel Machek wrote: > On Sat 2009-01-03 22:17:15, Duane Griffin wrote: > > [Fixed top-posting] > > > > 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>: > > > Pavel Machek wrote: > > >> readonly mount does actually write to the media in some cases. Document that. > > >> > > > Can one avoid replay of the journal then if it would be unclean? > > > Just curious. > > > > Nope. If the underlying block device is read-only then mounting the > > filesystem will fail. I tried to fix this some time ago, and have a > > set of patches that almost always work, but "almost always" isn't good > > enough. Unfortunately I never managed to figure out a way to finish it > > off without disgusting hacks or major surgery. > > Uhuh, can you just ignore the journal and mount it anyway? An ext3 file system that needs journal recovery sets one of the ext2 incompatible flags to prevent just that. > ...basically treating it like an ext2? > > ...ok, that will present "old" version of the filesystem to the > user... violating fsync() semantics. > > Still handy for recovering badly broken filesystems, I'd say. While you cannot have that, you'll need to dump the file system (possibly with dd_rescue) to another medium and work on the copy. That's what you should do anyways. ;-) I think if you really want to mount the file system without journal replay, you need to clear the needs-recovery "incompat" flag (on the copy, obviously). -- Matthias Andree ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 12:38 document ext3 requirements Pavel Machek 2009-01-03 21:17 ` Martin MOKREJŠ @ 2009-01-04 2:32 ` Theodore Tso 2009-01-04 22:33 ` Pavel Machek 2009-01-04 22:34 ` [patch] document ext3 a bit better Pavel Machek 2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov 2009-01-04 19:49 ` Rob Landley 3 siblings, 2 replies; 67+ messages in thread From: Theodore Tso @ 2009-01-04 2:32 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sat, Jan 03, 2009 at 01:38:15PM +0100, Pavel Machek wrote: > +Requirements > +============ > + > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > +behaving disk subsystem, data that have been successfully synced will > +stay on the disk. Sane means: > + > +* writes to media never fail. Even if disk returns error condition during > + write, ext3 can't handle that correctly, because success on fsync was already > + returned when data hit the journal. > + > + (Fortunately writes failing are very uncommon on disks, as they > + have spare sectors they use when write fails.) This is not unique to ext3; per the discussion two weeks ago, this is largely because of the fsync() interface not possibly being able to return errors caused by failures when creating or modifying parent directories. Given this, it's a bit misleading to place this in the Documentation/filesystems/ext3.txt. At the minimum it should include a discussion about what the issues might be, and given that pretty much any Unix/Linux filesystem doesn't have a way of reflecting these errors to application programs, it probably should be in a filesystem-independent documentation file. > +* either whole sector is correctly written or nothing is written during > + powerfail. > + > + (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave > + like this, and are unsuitable for ext3. Because RAM tends to fail > + faster than rest of system during powerfail, special hw killing > + DMA transfers may be neccessary. Not sure how common that problem > + is on generic PC machines). Again, this is true for other filesystems (it was first discovered on SGI "pizza boxes" machines running XFS, and special hardware changes added to allow DMA aborts) --- in fact, because of ext3's use of physical block journaling, it's much more likely that it will recover from these sorts of errors. So it's very misleading to have this sort of discussion in Documentation/filesystems/ext3.txt. > +* either write caching is disabled, or hw can do barriers and they are enabled. > + > + (Note that barriers are disabled by default, use "barrier=1" > + mount option after making sure hw can support them). We really should get akpm to agree to accept the patch to default barriers by default instead. :-) - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 2:32 ` Theodore Tso @ 2009-01-04 22:33 ` Pavel Machek 2009-01-04 22:34 ` [patch] document ext3 a bit better Pavel Machek 1 sibling, 0 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-04 22:33 UTC (permalink / raw) To: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc Hi! On Sat 2009-01-03 21:32:11, Theodore Tso wrote: > On Sat, Jan 03, 2009 at 01:38:15PM +0100, Pavel Machek wrote: > > +Requirements > > +============ > > + > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > > +behaving disk subsystem, data that have been successfully synced will > > +stay on the disk. Sane means: > > + > > +* writes to media never fail. Even if disk returns error condition during > > + write, ext3 can't handle that correctly, because success on fsync was already > > + returned when data hit the journal. > > + > > + (Fortunately writes failing are very uncommon on disks, as they > > + have spare sectors they use when write fails.) > > This is not unique to ext3; per the discussion two weeks ago, this is > largely because of the fsync() interface not possibly being able to Ok, so I guess I should split the patch to truly ext3-specific part, and the part that is common for all the filesystems. I guess I'll need some help with everything but ext2 and ext3... > return errors caused by failures when creating or modifying parent > directories. Given this, it's a bit misleading to place this in the > Documentation/filesystems/ext3.txt. At the minimum it should include > a discussion about what the issues might be, and given that pretty > much any Unix/Linux filesystem doesn't have a way of reflecting these > errors to application programs, it probably should be in a > filesystem-independent documentation file. Ok. I'll have to think about good name of that file. > > +* either write caching is disabled, or hw can do barriers and they are enabled. > > + > > + (Note that barriers are disabled by default, use "barrier=1" > > + mount option after making sure hw can support them). > > We really should get akpm to agree to accept the patch to default > barriers by default instead. :-) :-). Yes, that would help a bit. (No, it is not complete solution. barrier=0/writeback on should be still documented as unsafe). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* [patch] document ext3 a bit better 2009-01-04 2:32 ` Theodore Tso 2009-01-04 22:33 ` Pavel Machek @ 2009-01-04 22:34 ` Pavel Machek 2009-01-05 14:57 ` Theodore Tso 1 sibling, 1 reply; 67+ messages in thread From: Pavel Machek @ 2009-01-04 22:34 UTC (permalink / raw) To: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc ext3 has quite unexpected semantics or "ro" and defaults are usually not what they are documented to be, due to mkfs override. Signed-off-by: Pavel Machek <pavel@suse.cz> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9dd2a3b..113db1f 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -14,6 +14,11 @@ Options When mounting an ext3 filesystem, the following option are accepted: (*) == default +ro Mount filesystem read only. Note that ext3 will replay + the journal (and thus write to the partition) even when + mounted "read only". "ro, noload" can be used to prevent + writes to the filesystem. + journal=update Update the ext3 file system's journal to the current format. @@ -27,7 +32,9 @@ journal_dev=devnum When the external jou identified through its new major/minor numbers encoded in devnum. -noload Don't load the journal on mounting. +noload Don't load the journal on mounting. Note that this forces + mount of inconsistent filesystem, which can lead to + various problems. data=journal All data are committed into the journal prior to being written into the main file system. @@ -95,6 +102,8 @@ debug Extra debugging information is s errors=remount-ro(*) Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs. + (Note that default is overriden by superblock + setting on most systems). data_err=ignore(*) Just print an error message if an error occurs in a file data buffer in ordered mode. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [patch] document ext3 a bit better 2009-01-04 22:34 ` [patch] document ext3 a bit better Pavel Machek @ 2009-01-05 14:57 ` Theodore Tso 2009-01-06 9:21 ` Pavel Machek 0 siblings, 1 reply; 67+ messages in thread From: Theodore Tso @ 2009-01-05 14:57 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote: > @@ -14,6 +14,11 @@ Options > When mounting an ext3 filesystem, the following option are accepted: > (*) == default > > +ro Mount filesystem read only. Note that ext3 will replay > + the journal (and thus write to the partition) even when > + mounted "read only". "ro, noload" can be used to prevent > + writes to the filesystem. I'd sugest "ro,noload" since the spaces screw up the mount options parsing both on the command-line and in /etc/fstab. So how about: Using the mount options "ro,noload" can be used.... > @@ -95,6 +102,8 @@ debug Extra debugging information is s > errors=remount-ro(*) Remount the filesystem read-only on an error. > errors=continue Keep going on a filesystem error. > errors=panic Panic and halt the machine if an error occurs. > + (Note that default is overriden by superblock > + setting on most systems). The default is always specified by the superblock setting. So users will probably find it easier to understand if we remove the "(*)" and to add the explanatory comment: (These mount options override the errors behavior specified in the superblock, which can be configured using tune2fs) Pavel, thanks for working on improving the documentation; with these fixes, Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] document ext3 a bit better 2009-01-05 14:57 ` Theodore Tso @ 2009-01-06 9:21 ` Pavel Machek 2009-01-09 23:24 ` Jiri Kosina 0 siblings, 1 reply; 67+ messages in thread From: Pavel Machek @ 2009-01-06 9:21 UTC (permalink / raw) To: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, Trivial patch monkey On Mon 2009-01-05 09:57:13, Theodore Tso wrote: > On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote: > > @@ -14,6 +14,11 @@ Options > > When mounting an ext3 filesystem, the following option are accepted: > > (*) == default > > > > +ro Mount filesystem read only. Note that ext3 will replay > > + the journal (and thus write to the partition) even when > > + mounted "read only". "ro, noload" can be used to prevent > > + writes to the filesystem. > > I'd sugest "ro,noload" since the spaces screw up the mount options > parsing both on the command-line and in /etc/fstab. So how about: > > Using the mount options "ro,noload" can be used.... Too many "using", but yes, fixed, thanks. > > @@ -95,6 +102,8 @@ debug Extra debugging information is s > > errors=remount-ro(*) Remount the filesystem read-only on an error. > > errors=continue Keep going on a filesystem error. > > errors=panic Panic and halt the machine if an error occurs. > > + (Note that default is overriden by superblock > > + setting on most systems). > > The default is always specified by the superblock setting. So users > will probably find it easier to understand if we remove the "(*)" and > to add the explanatory comment: > > (These mount options override the errors behavior > specified in the superblock, which can be configured > using tune2fs) > > Pavel, thanks for working on improving the documentation; with these > fixes, Thanks! --- ext3 has quite unexpected semantics or "ro" and defaults are not what they are documented to be, due to mkfs override. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9dd2a3b..49c08bf 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -14,6 +14,11 @@ Options When mounting an ext3 filesystem, the following option are accepted: (*) == default +ro Mount filesystem read only. Note that ext3 will replay + the journal (and thus write to the partition) even when + mounted "read only". Mount options "ro,noload" can be + used to prevent writes to the filesystem. + journal=update Update the ext3 file system's journal to the current format. @@ -27,7 +32,9 @@ journal_dev=devnum When the external jou identified through its new major/minor numbers encoded in devnum. -noload Don't load the journal on mounting. +noload Don't load the journal on mounting. Note that this forces + mount of inconsistent filesystem, which can lead to + various problems. data=journal All data are committed into the journal prior to being written into the main file system. @@ -92,9 +99,12 @@ nocheck debug Extra debugging information is sent to syslog. -errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=remount-ro Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs. + (These mount options override the errors behavior + specified in the superblock, which can be + configured using tune2fs.) data_err=ignore(*) Just print an error message if an error occurs in a file data buffer in ordered mode. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [patch] document ext3 a bit better 2009-01-06 9:21 ` Pavel Machek @ 2009-01-09 23:24 ` Jiri Kosina 2009-01-09 23:36 ` Randy Dunlap 0 siblings, 1 reply; 67+ messages in thread From: Jiri Kosina @ 2009-01-09 23:24 UTC (permalink / raw) To: Pavel Machek, Randy Dunlap, Jonathan Corbet Cc: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, linux-doc, Trivial patch monkey On Tue, 6 Jan 2009, Pavel Machek wrote: > On Mon 2009-01-05 09:57:13, Theodore Tso wrote: > > On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote: > > > @@ -14,6 +14,11 @@ Options > > > When mounting an ext3 filesystem, the following option are accepted: > > > (*) == default > > > > > > +ro Mount filesystem read only. Note that ext3 will replay > > > + the journal (and thus write to the partition) even when > > > + mounted "read only". "ro, noload" can be used to prevent > > > + writes to the filesystem. > > > > I'd sugest "ro,noload" since the spaces screw up the mount options > > parsing both on the command-line and in /etc/fstab. So how about: > > > > Using the mount options "ro,noload" can be used.... > > Too many "using", but yes, fixed, thanks. > > > > @@ -95,6 +102,8 @@ debug Extra debugging information is s > > > errors=remount-ro(*) Remount the filesystem read-only on an error. > > > errors=continue Keep going on a filesystem error. > > > errors=panic Panic and halt the machine if an error occurs. > > > + (Note that default is overriden by superblock > > > + setting on most systems). > > > > The default is always specified by the superblock setting. So users > > will probably find it easier to understand if we remove the "(*)" and > > to add the explanatory comment: > > > > (These mount options override the errors behavior > > specified in the superblock, which can be configured > > using tune2fs) > > > > Pavel, thanks for working on improving the documentation; with these > > fixes, > > Thanks! > > --- > > ext3 has quite unexpected semantics or "ro" and defaults are > not what they are documented to be, due to mkfs override. > > Signed-off-by: Pavel Machek <pavel@suse.cz> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt > index 9dd2a3b..49c08bf 100644 > --- a/Documentation/filesystems/ext3.txt > +++ b/Documentation/filesystems/ext3.txt > @@ -14,6 +14,11 @@ Options > When mounting an ext3 filesystem, the following option are accepted: > (*) == default > > +ro Mount filesystem read only. Note that ext3 will replay > + the journal (and thus write to the partition) even when > + mounted "read only". Mount options "ro,noload" can be > + used to prevent writes to the filesystem. > + > journal=update Update the ext3 file system's journal to the current > format. > > @@ -27,7 +32,9 @@ journal_dev=devnum When the external jou > identified through its new major/minor numbers encoded > in devnum. > > -noload Don't load the journal on mounting. > +noload Don't load the journal on mounting. Note that this forces > + mount of inconsistent filesystem, which can lead to > + various problems. > > data=journal All data are committed into the journal prior to being > written into the main file system. > @@ -92,9 +99,12 @@ nocheck > > debug Extra debugging information is sent to syslog. > > -errors=remount-ro(*) Remount the filesystem read-only on an error. > +errors=remount-ro Remount the filesystem read-only on an error. > errors=continue Keep going on a filesystem error. > errors=panic Panic and halt the machine if an error occurs. > + (These mount options override the errors behavior > + specified in the superblock, which can be > + configured using tune2fs.) > > data_err=ignore(*) Just print an error message if an error occurs > in a file data buffer in ordered mode. > So, documentation guys, are you going to take this patch through the Documentation tree (tytso already Signed off on that), or should I take it through trivial tree? Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] document ext3 a bit better 2009-01-09 23:24 ` Jiri Kosina @ 2009-01-09 23:36 ` Randy Dunlap 2009-01-09 23:47 ` Jiri Kosina 0 siblings, 1 reply; 67+ messages in thread From: Randy Dunlap @ 2009-01-09 23:36 UTC (permalink / raw) To: Jiri Kosina Cc: Pavel Machek, Jonathan Corbet, Theodore Tso, kernel list, Andrew Morton, mtk.manpages, linux-doc, Trivial patch monkey Jiri Kosina wrote: > On Tue, 6 Jan 2009, Pavel Machek wrote: > >> On Mon 2009-01-05 09:57:13, Theodore Tso wrote: >>> On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote: >>>> @@ -14,6 +14,11 @@ Options >>>> When mounting an ext3 filesystem, the following option are accepted: >>>> (*) == default >>>> >>>> +ro Mount filesystem read only. Note that ext3 will replay >>>> + the journal (and thus write to the partition) even when >>>> + mounted "read only". "ro, noload" can be used to prevent >>>> + writes to the filesystem. >>> I'd sugest "ro,noload" since the spaces screw up the mount options >>> parsing both on the command-line and in /etc/fstab. So how about: >>> >>> Using the mount options "ro,noload" can be used.... >> Too many "using", but yes, fixed, thanks. >> >>>> @@ -95,6 +102,8 @@ debug Extra debugging information is s >>>> errors=remount-ro(*) Remount the filesystem read-only on an error. >>>> errors=continue Keep going on a filesystem error. >>>> errors=panic Panic and halt the machine if an error occurs. >>>> + (Note that default is overriden by superblock >>>> + setting on most systems). >>> The default is always specified by the superblock setting. So users >>> will probably find it easier to understand if we remove the "(*)" and >>> to add the explanatory comment: >>> >>> (These mount options override the errors behavior >>> specified in the superblock, which can be configured >>> using tune2fs) >>> >>> Pavel, thanks for working on improving the documentation; with these >>> fixes, >> Thanks! >> >> --- >> >> ext3 has quite unexpected semantics or "ro" and defaults are >> not what they are documented to be, due to mkfs override. >> >> Signed-off-by: Pavel Machek <pavel@suse.cz> >> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> >> >> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt >> index 9dd2a3b..49c08bf 100644 >> --- a/Documentation/filesystems/ext3.txt >> +++ b/Documentation/filesystems/ext3.txt >> @@ -14,6 +14,11 @@ Options >> When mounting an ext3 filesystem, the following option are accepted: >> (*) == default >> >> +ro Mount filesystem read only. Note that ext3 will replay >> + the journal (and thus write to the partition) even when >> + mounted "read only". Mount options "ro,noload" can be >> + used to prevent writes to the filesystem. >> + >> journal=update Update the ext3 file system's journal to the current >> format. >> >> @@ -27,7 +32,9 @@ journal_dev=devnum When the external jou >> identified through its new major/minor numbers encoded >> in devnum. >> >> -noload Don't load the journal on mounting. >> +noload Don't load the journal on mounting. Note that this forces >> + mount of inconsistent filesystem, which can lead to >> + various problems. >> >> data=journal All data are committed into the journal prior to being >> written into the main file system. >> @@ -92,9 +99,12 @@ nocheck >> >> debug Extra debugging information is sent to syslog. >> >> -errors=remount-ro(*) Remount the filesystem read-only on an error. >> +errors=remount-ro Remount the filesystem read-only on an error. >> errors=continue Keep going on a filesystem error. >> errors=panic Panic and halt the machine if an error occurs. >> + (These mount options override the errors behavior >> + specified in the superblock, which can be >> + configured using tune2fs.) >> >> data_err=ignore(*) Just print an error message if an error occurs >> in a file data buffer in ordered mode. >> > > So, documentation guys, are you going to take this patch through the > Documentation tree (tytso already Signed off on that), or should I take it (probably should be Acked-by or Reviewed-by if he isn't merging it) > through trivial tree? I'm so far behind on doc patches that I haven't read any of this thread yet, so you can merge it IMO. Thanks, -- ~Randy ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] document ext3 a bit better 2009-01-09 23:36 ` Randy Dunlap @ 2009-01-09 23:47 ` Jiri Kosina 0 siblings, 0 replies; 67+ messages in thread From: Jiri Kosina @ 2009-01-09 23:47 UTC (permalink / raw) To: Randy Dunlap Cc: Pavel Machek, Jonathan Corbet, Theodore Tso, kernel list, Andrew Morton, mtk.manpages, linux-doc, Trivial patch monkey On Fri, 9 Jan 2009, Randy Dunlap wrote: > I'm so far behind on doc patches that I haven't read any of this thread > yet, so you can merge it IMO. OK, I have applied it, thanks Pavel. -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 12:38 document ext3 requirements Pavel Machek 2009-01-03 21:17 ` Martin MOKREJŠ 2009-01-04 2:32 ` Theodore Tso @ 2009-01-04 13:35 ` Alexander E. Patrakov 2009-01-04 13:53 ` Valdis.Kletnieks ` (3 more replies) 2009-01-04 19:49 ` Rob Landley 3 siblings, 4 replies; 67+ messages in thread From: Alexander E. Patrakov @ 2009-01-04 13:35 UTC (permalink / raw) To: Pavel Machek Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc, Alan Cox Pavel Machek wrote: [CC: Alan Cox because of his reply in the "XFS internal error" thread] > Using ext3 is only safe if storage subsystem meets certain > criteria. Document those. Thanks for this patch. However, after reading this, I have a stupid question: which file system should I use if I had to reinstall my computers from scratch now? Ext3 means either hardware that supports barriers (not sure how to check, and anyway I have to use encryption on the work laptop due to the corporate policy) or disabling write cache (but, as Alan Cox said, this shortens the lifespan of the disk). Does this requirement apply to other journaling filesystems? Do I need journaling at all, given that I have an UPS on my desktop and a battery in the laptop? -- Alexander E. Patrakov ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov @ 2009-01-04 13:53 ` Valdis.Kletnieks 2009-01-04 18:21 ` Michael Tokarev ` (2 subsequent siblings) 3 siblings, 0 replies; 67+ messages in thread From: Valdis.Kletnieks @ 2009-01-04 13:53 UTC (permalink / raw) To: Alexander E. Patrakov Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc, Alan Cox [-- Attachment #1: Type: text/plain, Size: 1097 bytes --] On Sun, 04 Jan 2009 18:35:41 +0500, "Alexander E. Patrakov" said: > Ext3 means either hardware that supports barriers (not sure how to > check, and anyway I have to use encryption on the work laptop due to the > corporate policy) or disabling write cache (but, as Alan Cox said, this > shortens the lifespan of the disk). False dichotomy. This isn't an "either/or", as there's a *third* case: "understand the issues and risks involved if you have a write cache and no barrier support, and learn to deal with it". As you point out, if it's a laptop with a battery, the risk may be *very* low. Let's say there's a 1 in 10,000 chance that you'll trash a file system and need to restore from backups. That may be totally acceptable if you've already estimated a 1 in 500 chance of the whole damned laptop going walkies while you're not looking, and then you *still* need to be able to restore from backups onto a replacement machine. Yes, for some systems, the whole barriers/write cache thing is in fact very important. But for others, data loss due to spilled coffee is a bigger worry... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov 2009-01-04 13:53 ` Valdis.Kletnieks @ 2009-01-04 18:21 ` Michael Tokarev 2009-01-04 18:38 ` Theodore Tso 2009-01-04 20:10 ` Pavel Machek 3 siblings, 0 replies; 67+ messages in thread From: Michael Tokarev @ 2009-01-04 18:21 UTC (permalink / raw) To: Alexander E. Patrakov Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc, Alan Cox Alexander E. Patrakov wrote: [] > Ext3 means either hardware that supports barriers (not sure how to > check, and anyway I have to use encryption on the work laptop due to the > corporate policy) or disabling write cache (but, as Alan Cox said, this > shortens the lifespan of the disk). Does this requirement apply to other > journaling filesystems? Do I need journaling at all, given that I have > an UPS on my desktop and a battery in the laptop? There's another possibility too, somewhat more risky. Namely, run with write cache ON by default, and switch it off when running off battery (either UPS or notebook). Should save both worlds, PROVIDED the battery actually/UPS works :) /mjt ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov 2009-01-04 13:53 ` Valdis.Kletnieks 2009-01-04 18:21 ` Michael Tokarev @ 2009-01-04 18:38 ` Theodore Tso 2009-01-04 22:37 ` Pavel Machek 2009-01-05 11:43 ` Alan Cox 2009-01-04 20:10 ` Pavel Machek 3 siblings, 2 replies; 67+ messages in thread From: Theodore Tso @ 2009-01-04 18:38 UTC (permalink / raw) To: Alexander E. Patrakov Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, Alan Cox On Sun, Jan 04, 2009 at 06:35:41PM +0500, Alexander E. Patrakov wrote: > > Ext3 means either hardware that supports barriers (not sure how to > check Pretty much all modern disk drives supports barriers. And note that w/o barriers ext3 has worked pretty well. *If* you have a workload pushes your system into a mode which where it is very low on memory, so it is constantly paging/thrashing and you have a workload which is metadata intensive, and you crash the machine while it is thrashing, it is possible to end up in a situation where your filesystem is corrupted and you have to use e2fsck to correct the filesystem. In practice this is often not the case, which is why the default for ext3 has been with barriers disabled, and most people have not noted major problems. This is why Andrew Morton has refused accept the patch for ext3 which disables barriers by default; he's not convinced the performance hit is worth the improvement in reliability. Ext4 does enable barriers by defaults, mainly because filesystem developers tend to be believe the reliability is more important than performance. (On the other hand, Google runs with ext2 w/o journalling, because everything is replicated three times and it's easier to just blow away the filesystem and resync from one of the duplicate copies; so in the right circumstances, maybe worrying only about performance and ignoring reliability makes perfect sense.) > and anyway I have to use encryption on the work laptop due to the > corporate policy If dm supported barriers, this wouldn't be an issue. Personally, I find the convenience of LVM is so useful that I use ext4 with LVM, even though the barrier requests get dropped on the ground. And I'm a kernel developer, and I use a laptop with suspend/resume, which means I often crash uncleanly --- and I've not lost data yet, despite the lack of barriers. (On the other hand, my laptop has 4 gigs of memory, so I'm rarely thrashing due memory pressure.) > or disabling write cache (but, as Alan Cox said, this > shortens the lifespan of the disk). Huh? I've never heard an assertion that disabling the write cache (I assume you mean using write-through caching as opposed to write-back caching), shortens the lifespan of disk drives. Aggressive battery saving mode is far more likely to shorten disk drive life, due to spinning the platters up and down a lot. > Does this requirement apply to other > journaling filesystems? Do I need journaling at all, given that I have > an UPS on my desktop and a battery in the laptop? Which requirement? Barriers? Most journaling filesystems simply enable barriers by default. And journalling is useful so that if your system crashes, say due to suspend and resume not working out, or the battery runs dry without your noticing it, you can avoid running fsck at boot time. It's really more about shorting the boot time after a crash more than anything else. - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 18:38 ` Theodore Tso @ 2009-01-04 22:37 ` Pavel Machek 2009-01-04 23:58 ` Theodore Tso 2009-01-05 11:43 ` Alan Cox 1 sibling, 1 reply; 67+ messages in thread From: Pavel Machek @ 2009-01-04 22:37 UTC (permalink / raw) To: Theodore Tso, Alexander E. Patrakov, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, Alan Cox On Sun 2009-01-04 13:38:34, Theodore Tso wrote: > On Sun, Jan 04, 2009 at 06:35:41PM +0500, Alexander E. Patrakov wrote: > > > > Ext3 means either hardware that supports barriers (not sure how to > > check > > Pretty much all modern disk drives supports barriers. And note that > w/o barriers ext3 has worked pretty well. *If* you have a workload > pushes your system into a mode which where it is very low on memory, > so it is constantly paging/thrashing and you have a workload which is > metadata intensive, and you crash the machine while it is thrashing, > it is possible to end up in a situation where your filesystem is > corrupted and you have to use e2fsck to correct the filesystem. In Are you sure you need to have thrashing? AFAICT metadata + fsync heavy workload should be enough... and there were scripts to easily repeat that. > > Does this requirement apply to other > > journaling filesystems? Do I need journaling at all, given that I have > > an UPS on my desktop and a battery in the laptop? > > Which requirement? Barriers? Most journaling filesystems simply > enable barriers by default. > > And journalling is useful so that if your system crashes, say due to > suspend and resume not working out, or the battery runs dry without > your noticing it, you can avoid running fsck at boot time. It's > really more about shorting the boot time after a crash more than > anything else. Actually, journalling with barriers=0 should still be "safe" in case of kernel crashes (*), right? Because if just kernel is dead, disk firmware will still write the cache back, AFAICT. Pavel (*) kernel crashes that do not involve writing random garbage to disk. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 22:37 ` Pavel Machek @ 2009-01-04 23:58 ` Theodore Tso 0 siblings, 0 replies; 67+ messages in thread From: Theodore Tso @ 2009-01-04 23:58 UTC (permalink / raw) To: Pavel Machek Cc: Alexander E. Patrakov, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, Alan Cox On Sun, Jan 04, 2009 at 11:37:56PM +0100, Pavel Machek wrote: > > Are you sure you need to have thrashing? AFAICT metadata + fsync heavy > workload should be enough... and there were scripts to easily repeat > that. The memory pressure is needed to force disk buffers out to disk sooner than fsync() would normally force buffers out. The scripts which I've seen induced memory pressure. If the disk is *super* aggressive at reordering writes, I suppose a heavy fsync workload might be enough on its own, but in practice, it's generally not enough. - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 18:38 ` Theodore Tso 2009-01-04 22:37 ` Pavel Machek @ 2009-01-05 11:43 ` Alan Cox 2009-01-07 11:59 ` Rob Landley 1 sibling, 1 reply; 67+ messages in thread From: Alan Cox @ 2009-01-05 11:43 UTC (permalink / raw) To: Theodore Tso Cc: Alexander E. Patrakov, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc > If dm supported barriers, this wouldn't be an issue. Personally, I "If the dm people applied the patches to support barriers" I believe is the correct description - Andi ? dm and md want fixing and even in the md case it isn't hard to do right. > > or disabling write cache (but, as Alan Cox said, this > > shortens the lifespan of the disk). > > Huh? I've never heard an assertion that disabling the write cache (I > assume you mean using write-through caching as opposed to write-back > caching), shortens the lifespan of disk drives. Aggressive battery Thats what I was told by a disk vendor - simply because the drive makes a lot more mechanical movements and writes. > your noticing it, you can avoid running fsck at boot time. It's > really more about shorting the boot time after a crash more than > anything else. That depends enormously on your environment. In a secure environment full data journalling is practically essential to avoid the tiny risk of bits of important data turning up in another users file. Alan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-05 11:43 ` Alan Cox @ 2009-01-07 11:59 ` Rob Landley 0 siblings, 0 replies; 67+ messages in thread From: Rob Landley @ 2009-01-07 11:59 UTC (permalink / raw) To: Alan Cox Cc: Theodore Tso, Alexander E. Patrakov, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Monday 05 January 2009 05:43:29 Alan Cox wrote: > > Huh? I've never heard an assertion that disabling the write cache (I > > assume you mean using write-through caching as opposed to write-back > > caching), shortens the lifespan of disk drives. Aggressive battery > > Thats what I was told by a disk vendor - simply because the drive makes a > lot more mechanical movements and writes. It certainly sounds like less write cacheing would shorten the lifespan of flash devices... Rob ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov ` (2 preceding siblings ...) 2009-01-04 18:38 ` Theodore Tso @ 2009-01-04 20:10 ` Pavel Machek 3 siblings, 0 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-04 20:10 UTC (permalink / raw) To: Alexander E. Patrakov Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc, Alan Cox On Sun 2009-01-04 18:35:41, Alexander E. Patrakov wrote: > Pavel Machek wrote: > [CC: Alan Cox because of his reply in the "XFS internal error" thread] > >> Using ext3 is only safe if storage subsystem meets certain >> criteria. Document those. > > Thanks for this patch. However, after reading this, I have a stupid > question: which file system should I use if I had to reinstall my > computers from scratch now? ext2 is still the safest default... if you can live with fsck. ext3 is the safest from the journalling ones, AFAICT. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-03 12:38 document ext3 requirements Pavel Machek ` (2 preceding siblings ...) 2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov @ 2009-01-04 19:49 ` Rob Landley 2009-01-04 22:06 ` Theodore Tso 2009-01-04 22:55 ` Pavel Machek 3 siblings, 2 replies; 67+ messages in thread From: Rob Landley @ 2009-01-04 19:49 UTC (permalink / raw) To: Pavel Machek Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Saturday 03 January 2009 06:38:15 Pavel Machek wrote: > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > +behaving disk subsystem, data that have been successfully synced will > +stay on the disk. Sane means: > + > +* writes to media never fail. Even if disk returns error condition during > + write, ext3 can't handle that correctly, because success on fsync was > already + returned when data hit the journal. > + > + (Fortunately writes failing are very uncommon on disks, as they > + have spare sectors they use when write fails.) > + > +* either whole sector is correctly written or nothing is written during > + powerfail. > + > + (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave > + like this, and are unsuitable for ext3. Want to document the granularity issues with flash, while you're at it? An inherent problem with using flash as a normal block device is that the flash erase size is bigger than most filesystem sector sizes. So when you request a write, it may erase and rewrite the next 64k, 128k, or even a couple megabytes on the really _big_ ones. If you lose power in the middle of that, ext3 won't notice that data in the "sectors" _after_ the one your were trying to write to got trashed. The flash filesystems take this into account as part of their wear levelling stuff (they normally copy the entire chunk into a new chunk, leaving the old one in place until it's no longer needed), but they need to query the device to get the erase granularity in order to do that, which is why they don't work on non-flash block devices. Rob ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 19:49 ` Rob Landley @ 2009-01-04 22:06 ` Theodore Tso 2009-01-04 22:25 ` Pavel Machek ` (3 more replies) 2009-01-04 22:55 ` Pavel Machek 1 sibling, 4 replies; 67+ messages in thread From: Theodore Tso @ 2009-01-04 22:06 UTC (permalink / raw) To: Rob Landley Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote: > > Want to document the granularity issues with flash, while you're at it? > > An inherent problem with using flash as a normal block device is that the > flash erase size is bigger than most filesystem sector sizes. So when you > request a write, it may erase and rewrite the next 64k, 128k, or even a couple > megabytes on the really _big_ ones. > > If you lose power in the middle of that, ext3 won't notice that data in the > "sectors" _after_ the one your were trying to write to got trashed. True enough, although the newer SSD's will have this problem addressed (although at least initially, they are **far** more costly than the el-cheapo 32GB SD cards you can find at the checkout counter at Fry's alongside battery-powered shavers and trashy ipod speakers). I will stress again, that most of this doesn't belong in Documentation/filesystems/ext3.txt, as most of this is *not* ext3-specific. - Ted ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 22:06 ` Theodore Tso @ 2009-01-04 22:25 ` Pavel Machek 2009-01-04 23:00 ` [patch] " Pavel Machek ` (2 subsequent siblings) 3 siblings, 0 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-04 22:25 UTC (permalink / raw) To: Theodore Tso, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun 2009-01-04 17:06:34, Theodore Tso wrote: > On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote: > > > > Want to document the granularity issues with flash, while you're at it? > > > > An inherent problem with using flash as a normal block device is that the > > flash erase size is bigger than most filesystem sector sizes. So when you > > request a write, it may erase and rewrite the next 64k, 128k, or even a couple > > megabytes on the really _big_ ones. > > > > If you lose power in the middle of that, ext3 won't notice that data in the > > "sectors" _after_ the one your were trying to write to got trashed. > > True enough, although the newer SSD's will have this problem addressed > (although at least initially, they are **far** more costly than the > el-cheapo 32GB SD cards you can find at the checkout counter at Fry's > alongside battery-powered shavers and trashy ipod speakers). > > I will stress again, that most of this doesn't belong in > Documentation/filesystems/ext3.txt, as most of this is *not* > ext3-specific. I've initially done the patch for ext3 because that's what I'm using and becuase I felt responsible for documenting it after a huge thread. At least barrier=1 seems to be ext3 specific, and perhaps logfs or something can survive full eraseblocks disappearing. Anyway, i guess we all agree that this needs to be documented _somewhere_, and that's what I'm trying to do. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* [patch] Re: document ext3 requirements 2009-01-04 22:06 ` Theodore Tso 2009-01-04 22:25 ` Pavel Machek @ 2009-01-04 23:00 ` Pavel Machek 2009-01-05 2:42 ` Rob Landley 2009-01-04 23:07 ` Pavel Machek 2009-01-05 1:38 ` Rob Landley 3 siblings, 1 reply; 67+ messages in thread From: Pavel Machek @ 2009-01-04 23:00 UTC (permalink / raw) To: Theodore Tso, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun 2009-01-04 17:06:34, Theodore Tso wrote: > On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote: > > > > Want to document the granularity issues with flash, while you're at it? > > > > An inherent problem with using flash as a normal block device is that the > > flash erase size is bigger than most filesystem sector sizes. So when you > > request a write, it may erase and rewrite the next 64k, 128k, or even a couple > > megabytes on the really _big_ ones. > > > > If you lose power in the middle of that, ext3 won't notice that data in the > > "sectors" _after_ the one your were trying to write to got trashed. > > True enough, although the newer SSD's will have this problem addressed > (although at least initially, they are **far** more costly than the > el-cheapo 32GB SD cards you can find at the checkout counter at Fry's > alongside battery-powered shavers and trashy ipod speakers). > > I will stress again, that most of this doesn't belong in > Documentation/filesystems/ext3.txt, as most of this is *not* > ext3-specific. Agreed... So what about this one? --- Document linux filesystem expectations. Ext3 can't handle write errors of any kind, and can't handle non-atomic sector writes. Other filesystems are probably even worse... Signed-off-by: Pavel Machek <pavel@suse.cz> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..7817a9c --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,44 @@ +Linux filesystems can only work correctly when several conditions are +met in the block layer and below (disks, flash cards). Some of them +are obvious ("data on media should not change randomly"), some are +less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly, because success +on fsync was already returned when data hit the journal. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Sector writes are atomic (ATOMIC-SECTORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Unfortuantely, none of the cheap USB/SD flash cards I seen do + behave like this, and are unsuitable for all linux filesystems + I know. + + An inherent problem with using flash as a normal block + device is that the flash erase size is bigger than + most filesystem sector sizes. So when you request a + write, it may erase and rewrite the next 64k, 128k, or + even a couple megabytes on the really _big_ ones. + + If you lose power in the middle of that, filesystem + won't notice that data in the "sectors" _after_ the + one your were trying to write to got trashed. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be neccessary; + otherwise, disks may write garbage during powerfail. + Not sure how common that problem is on generic PC machines. + + + + diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9dd2a3b..8cb64b0 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -188,6 +197,25 @@ mke2fs: create a ext3 partition with th debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed + +* sector writes are atomic + +(see expectations.txt; note that most/all linux filesystems have similar +expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [patch] Re: document ext3 requirements 2009-01-04 23:00 ` [patch] " Pavel Machek @ 2009-01-05 2:42 ` Rob Landley 2009-01-05 9:54 ` Pavel Machek 0 siblings, 1 reply; 67+ messages in thread From: Rob Landley @ 2009-01-05 2:42 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sunday 04 January 2009 17:00:53 Pavel Machek wrote: > Document linux filesystem expectations. Ext3 can't handle write errors > of any kind, and can't handle non-atomic sector writes. Other > filesystems are probably even worse... These concerns look like they're specifically for block backed filesystems, which is one of four different types. I wrote a longish incoherent rant to the busybox list about the different types of filesystems a couple months back, in the context of a thread about implementing the "mount" command. Dunno how relevant it is: http://lists.busybox.net/pipermail/busybox/2008-November/067970.html There are a couple fun relevant corner cases, such as the fact that nfs is the only filesystem I'm aware of where the return value of close() can actually mean something. (Due to the cacheing, you tend to get errors reported _there_. I don't remember why, if I ever knew.) > Signed-off-by: Pavel Machek <pavel@suse.cz> > > diff --git a/Documentation/filesystems/expectations.txt > b/Documentation/filesystems/expectations.txt new file mode 100644 > index 0000000..7817a9c > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > @@ -0,0 +1,44 @@ > +Linux filesystems can only work correctly when several conditions are > +met in the block layer and below (disks, flash cards). Some of them > +are obvious ("data on media should not change randomly"), some are > +less so. > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly, because success > +on fsync was already returned when data hit the journal. > + > + Fortunately writes failing are very uncommon on traditional > + spinning disks, as they have spare sectors they use when write > + fails. The failures show up in dmesg(), and some filesystems will remount themselves read only if the physical media driver manages to propogate an error back to to the filesystem. (Note that the scsi subsystem has historically had so many glue layers that it couldn't manage to do this; that's been improved over the years but whether or not it actually _works_ now, I couldn't tell you.) Some kind of system monitor could notice the dmesg entries, but the actual write goes into the cache and the physical media error normally happens long after the program that did the write returned from its write call, often after it closed the file, and sometimes after it exited. Even sync() and fsync() won't help you there because if multiple processes do that, only the _first_ one will get the physical media error. (The filesystem doesn't associate physical media errors with processes; there's too many layers in between and it's not necessarily a 1:1 relationship anyway.) > +Sector writes are atomic (ATOMIC-SECTORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Unfortuantely, none of the cheap USB/SD flash cards I seen do > + behave like this, and are unsuitable for all linux filesystems > + I know. My impression is you might as well leave the suckers vfat. It's a stupid little filesystem but its very stupidity makes it as resistant to damage as anything else (which admittedly isn't much), and it's had such a history of _taking_ damage that the tools to cope with damage to it are actually pretty good. That said, constant updates to the first few sectors will burn out your USB flash disk if you use it as something other than a backup media. That's true with a lot of filesystems. (Hardware wear levelling isn't very good, it cycles between the same dozen or so physical sectors for each logical sector.) In general, those game consoles that say "please don't power off the thing while we're writing to flash" have a reason for the message. :) > + An inherent problem with using flash as a normal block > + device is that the flash erase size is bigger than > + most filesystem sector sizes. So when you request a > + write, it may erase and rewrite the next 64k, 128k, or > + even a couple megabytes on the really _big_ ones. > + > + If you lose power in the middle of that, filesystem > + won't notice that data in the "sectors" _after_ the > + one your were trying to write to got trashed. > + > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be neccessary; > + otherwise, disks may write garbage during powerfail. > + Not sure how common that problem is on generic PC machines. > + > + > + > + > diff --git a/Documentation/filesystems/ext3.txt > b/Documentation/filesystems/ext3.txt index 9dd2a3b..8cb64b0 100644 > --- a/Documentation/filesystems/ext3.txt > +++ b/Documentation/filesystems/ext3.txt > @@ -188,6 +197,25 @@ mke2fs: create a ext3 partition with th > debugfs: ext2 and ext3 file system debugger. > ext2online: online (mounted) ext2 and ext3 filesystem resizer > > +Requirements > +============ > + > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > +behaving disk subsystem, data that have been successfully synced will > +stay on the disk. Sane means: > + > +* write errors not allowed > + > +* sector writes are atomic > + > +(see expectations.txt; note that most/all linux filesystems have similar > +expectations) nfs, cifs, procfs, sysfs, usbfs, tmpfs, ramfs, fuse... > +* either write caching is disabled, or hw can do barriers and they are > enabled. + > + (Note that barriers are disabled by default, use "barrier=1" > + mount option after making sure hw can support them). So how does one make sure hw can support them? Rob ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] Re: document ext3 requirements 2009-01-05 2:42 ` Rob Landley @ 2009-01-05 9:54 ` Pavel Machek 0 siblings, 0 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-05 9:54 UTC (permalink / raw) To: Rob Landley Cc: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc > On Sunday 04 January 2009 17:00:53 Pavel Machek wrote: > > Document linux filesystem expectations. Ext3 can't handle write errors > > of any kind, and can't handle non-atomic sector writes. Other > > filesystems are probably even worse... > > These concerns look like they're specifically for block backed filesystems, > which is one of four different types. I wrote a longish incoherent > rant to I updated the docs. It now states "block-backed filesystems" in the first sentence. > > +Write errors not allowed (NO-WRITE-ERRORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Writes to media never fail. Even if disk returns error condition > > +during write, filesystems can't handle that correctly, because success > > +on fsync was already returned when data hit the journal. > > + > > + Fortunately writes failing are very uncommon on traditional > > + spinning disks, as they have spare sectors they use when write > > + fails. > > The failures show up in dmesg(), and some filesystems will remount themselves > read only if the physical media driver manages to propogate an error back to > to the filesystem. (Note that the scsi subsystem has historically Well, you may get an error in dmesg(), but your data are already gone at that point (and apps don't read dmesg, anyway :-). > Even sync() and fsync() won't help you there because if multiple processes do > that, only the _first_ one will get the physical media error. (The filesystem > doesn't associate physical media errors with processes; there's too many > layers in between and it's not necessarily a 1:1 relationship > anyway.) sync() does not even have return value. Yep. I'm trying to get fsync manpage updated. > > +* either write caching is disabled, or hw can do barriers and they are > > enabled. + > > + (Note that barriers are disabled by default, use "barrier=1" > > + mount option after making sure hw can support them). > > So how does one make sure hw can support them? hdparm -I reports them. If you don't see "Native Command Queueing", you have a problem. Interestingly, neither x60 notebook not pretty recent amd workstation has NCQ... Amd notebook seems to be ok. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 22:06 ` Theodore Tso 2009-01-04 22:25 ` Pavel Machek 2009-01-04 23:00 ` [patch] " Pavel Machek @ 2009-01-04 23:07 ` Pavel Machek 2009-01-05 1:38 ` Rob Landley 3 siblings, 0 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-04 23:07 UTC (permalink / raw) To: Theodore Tso, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sun 2009-01-04 17:06:34, Theodore Tso wrote: > On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote: > > > > Want to document the granularity issues with flash, while you're at it? > > > > An inherent problem with using flash as a normal block device is that the > > flash erase size is bigger than most filesystem sector sizes. So when you > > request a write, it may erase and rewrite the next 64k, 128k, or even a couple > > megabytes on the really _big_ ones. > > > > If you lose power in the middle of that, ext3 won't notice that data in the > > "sectors" _after_ the one your were trying to write to got trashed. > > True enough, although the newer SSD's will have this problem addressed > (although at least initially, they are **far** more costly than the > el-cheapo 32GB SD cards you can find at the checkout counter at Fry's > alongside battery-powered shavers and trashy ipod speakers). Hey, I got one of those el-cheapo 32GB SD cards. I fully expected it to be slow, but eating my data 3 times per month was unexpected even for me. I'm not even sure where the blame is. I certainly blame the Linux documentation: there should be "DON'T USE CRAPPY SD CARDS" warning in big bold letters somewhere. I guess mkfs.ext3 should just refuse to make filesystem on them. (Of course, the manufacturer should have told me that the card is crap; I can bet it can not even work with VFAT/Windows). Plus I'd hope some filesystem materializes that can handle 128KB "block size"... because the el-cheapo card I have here is actually pretty sane. It seems to store data I put on it, and should be safe to use with huge block size... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 22:06 ` Theodore Tso ` (2 preceding siblings ...) 2009-01-04 23:07 ` Pavel Machek @ 2009-01-05 1:38 ` Rob Landley 3 siblings, 0 replies; 67+ messages in thread From: Rob Landley @ 2009-01-05 1:38 UTC (permalink / raw) To: Theodore Tso Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Sunday 04 January 2009 16:06:34 Theodore Tso wrote: > True enough, although the newer SSD's will have this problem addressed > (although at least initially, they are **far** more costly than the > el-cheapo 32GB SD cards you can find at the checkout counter at Fry's > alongside battery-powered shavers and trashy ipod speakers). I have great faith in the ability of PC hardware to continue to be crap for the foreseeable future. > I will stress again, that most of this doesn't belong in > Documentation/filesystems/ext3.txt, as most of this is *not* > ext3-specific. Yes and no. Ext3 is enough of a "default" filesystem for Linux that some documentation on when _not_ to use sounds like a good idea. That said, some kind of a "choosing a filesystem" file would be good, perhaps under the filesystems directory. (Then the ext3 doc would just need a brief comment and a pointer to the other file.) Rob ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 19:49 ` Rob Landley 2009-01-04 22:06 ` Theodore Tso @ 2009-01-04 22:55 ` Pavel Machek 2009-01-05 0:16 ` david ` (2 more replies) 1 sibling, 3 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-04 22:55 UTC (permalink / raw) To: Rob Landley Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Sun 2009-01-04 13:49:49, Rob Landley wrote: > On Saturday 03 January 2009 06:38:15 Pavel Machek wrote: > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > > +behaving disk subsystem, data that have been successfully synced will > > +stay on the disk. Sane means: > > + > > +* writes to media never fail. Even if disk returns error condition during > > + write, ext3 can't handle that correctly, because success on fsync was > > already + returned when data hit the journal. > > + > > + (Fortunately writes failing are very uncommon on disks, as they > > + have spare sectors they use when write fails.) > > + > > +* either whole sector is correctly written or nothing is written during > > + powerfail. > > + > > + (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave > > + like this, and are unsuitable for ext3. > > Want to document the granularity issues with flash, while you're at it? > > An inherent problem with using flash as a normal block device is that the > flash erase size is bigger than most filesystem sector sizes. So when you > request a write, it may erase and rewrite the next 64k, 128k, or even a couple > megabytes on the really _big_ ones. > > If you lose power in the middle of that, ext3 won't notice that data in the > "sectors" _after_ the one your were trying to write to got trashed. > > The flash filesystems take this into account as part of their wear levelling > stuff (they normally copy the entire chunk into a new chunk, leaving the old > one in place until it's no longer needed), but they need to query the device > to get the erase granularity in order to do that, which is why they don't work > on non-flash block devices. Is there linux filesystem that can handle that? I know jffs2, but that's unsuitable for stuff like USB thumb drives, right? Does this sound like a fair summary? Sector writes are atomic (ATOMIC-SECTORS) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Either whole sector is correctly written or nothing is written during powerfail. Unfortuantely, none of the cheap USB/SD flash cards I seen do behave like this, and are unsuitable for all linux filesystems I know. An inherent problem with using flash as a normal block device is that the flash erase size is bigger than most filesystem sector sizes. So when you request a write, it may erase and rewrite the next 64k, 128k, or even a couple megabytes on the really _big_ ones. If you lose power in the middle of that, filesystem won't notice that data in the "sectors" _after_ the one your were trying to write to got trashed. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 22:55 ` Pavel Machek @ 2009-01-05 0:16 ` david 2009-01-05 9:38 ` Pavel Machek 2009-01-05 1:50 ` Rob Landley 2009-01-05 3:20 ` Martin K. Petersen 2 siblings, 1 reply; 67+ messages in thread From: david @ 2009-01-05 0:16 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Sun, 4 Jan 2009, Pavel Machek wrote: > On Sun 2009-01-04 13:49:49, Rob Landley wrote: >> On Saturday 03 January 2009 06:38:15 Pavel Machek wrote: >>> +Ext3 expects disk/storage subsystem to behave sanely. On sanely >>> +behaving disk subsystem, data that have been successfully synced will >>> +stay on the disk. Sane means: >>> + >>> +* writes to media never fail. Even if disk returns error condition during >>> + write, ext3 can't handle that correctly, because success on fsync was >>> already + returned when data hit the journal. >>> + >>> + (Fortunately writes failing are very uncommon on disks, as they >>> + have spare sectors they use when write fails.) >>> + >>> +* either whole sector is correctly written or nothing is written during >>> + powerfail. >>> + >>> + (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave >>> + like this, and are unsuitable for ext3. >> >> Want to document the granularity issues with flash, while you're at it? >> >> An inherent problem with using flash as a normal block device is that the >> flash erase size is bigger than most filesystem sector sizes. So when you >> request a write, it may erase and rewrite the next 64k, 128k, or even a couple >> megabytes on the really _big_ ones. >> >> If you lose power in the middle of that, ext3 won't notice that data in the >> "sectors" _after_ the one your were trying to write to got trashed. >> >> The flash filesystems take this into account as part of their wear levelling >> stuff (they normally copy the entire chunk into a new chunk, leaving the old >> one in place until it's no longer needed), but they need to query the device >> to get the erase granularity in order to do that, which is why they don't work >> on non-flash block devices. > > Is there linux filesystem that can handle that? I know jffs2, but > that's unsuitable for stuff like USB thumb drives, right? > > Does this sound like a fair summary? > > Sector writes are atomic (ATOMIC-SECTORS) > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Either whole sector is correctly written or nothing is written during > powerfail. > > Unfortuantely, none of the cheap USB/SD flash cards I seen do > behave like this, and are unsuitable for all linux filesystems > I know. > > An inherent problem with using flash as a normal block > device is that the flash erase size is bigger than > most filesystem sector sizes. So when you request a > write, it may erase and rewrite the next 64k, 128k, or > even a couple megabytes on the really _big_ ones. > > If you lose power in the middle of that, filesystem > won't notice that data in the "sectors" _after_ the > one your were trying to write to got trashed. around, not after. the block you are reading could be in the middle or at the end of an eraseblock. David Lang ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-05 0:16 ` david @ 2009-01-05 9:38 ` Pavel Machek 0 siblings, 0 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-05 9:38 UTC (permalink / raw) To: david Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc > >Sector writes are atomic (ATOMIC-SECTORS) > >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > >Either whole sector is correctly written or nothing is written during > >powerfail. > > > > Unfortuantely, none of the cheap USB/SD flash cards I seen do > > behave like this, and are unsuitable for all linux filesystems > > I know. > > > > An inherent problem with using flash as a normal block > > device is that the flash erase size is bigger than > > most filesystem sector sizes. So when you request a > > write, it may erase and rewrite the next 64k, 128k, or > > even a couple megabytes on the really _big_ ones. > > > > If you lose power in the middle of that, filesystem > > won't notice that data in the "sectors" _after_ the > > one your were trying to write to got trashed. > > around, not after. the block you are reading could be in the middle or at > the end of an eraseblock. Applied, thanks. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 22:55 ` Pavel Machek 2009-01-05 0:16 ` david @ 2009-01-05 1:50 ` Rob Landley 2009-01-05 3:20 ` Martin K. Petersen 2 siblings, 0 replies; 67+ messages in thread From: Rob Landley @ 2009-01-05 1:50 UTC (permalink / raw) To: Pavel Machek Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc On Sunday 04 January 2009 16:55:45 Pavel Machek wrote: > On Sun 2009-01-04 13:49:49, Rob Landley wrote: > > On Saturday 03 January 2009 06:38:15 Pavel Machek wrote: > > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > > > +behaving disk subsystem, data that have been successfully synced will > > > +stay on the disk. Sane means: > > > + > > > +* writes to media never fail. Even if disk returns error condition > > > during + write, ext3 can't handle that correctly, because success on > > > fsync was already + returned when data hit the journal. > > > + > > > + (Fortunately writes failing are very uncommon on disks, as they > > > + have spare sectors they use when write fails.) > > > + > > > +* either whole sector is correctly written or nothing is written > > > during + powerfail. > > > + > > > + (Unfortuantely, none of the cheap USB/SD flash cards I seen do > > > behave + like this, and are unsuitable for ext3. > > > > Want to document the granularity issues with flash, while you're at it? > > > > An inherent problem with using flash as a normal block device is that the > > flash erase size is bigger than most filesystem sector sizes. So when > > you request a write, it may erase and rewrite the next 64k, 128k, or even > > a couple megabytes on the really _big_ ones. > > > > If you lose power in the middle of that, ext3 won't notice that data in > > the "sectors" _after_ the one your were trying to write to got trashed. > > > > The flash filesystems take this into account as part of their wear > > levelling stuff (they normally copy the entire chunk into a new chunk, > > leaving the old one in place until it's no longer needed), but they need > > to query the device to get the erase granularity in order to do that, > > which is why they don't work on non-flash block devices. > > Is there linux filesystem that can handle that? I know jffs2, but > that's unsuitable for stuff like USB thumb drives, right? Any of the flash filesystems should handle that. The main problem with jffs2 is it doesn't scale well to large device sizes. UBIFS is supposed to scale much better, but I haven't played with it yet. And the thing about USB thumb drives is they present as a normal block device, _not_ as flash, so you can't _query_ their erase granularity. (It's like those hardware raid cards that wouldn't tell you they were striping and such so you had to figure out a well-performing layout all by yourself.) They do it magically behind the scenes, and if the power goes out (or you yank the device out unexpectedly) if they haven't got a built-in capacitor or battery to have enough power to complete their pending transaction, you're screwed. Plus they do horrible wear levelling, the lot of 'em. Read Val Henson's livejournal entry about it: http://valhenson.livejournal.com/25228.html There was also a marvelous thread Linus participated in on some hardware industry web message board, but I have no idea where it's gone... > Does this sound like a fair summary? See Ted's comment. The summary's fine, the question is where to put this sort of thing... > If you lose power in the middle of that, filesystem > won't notice that data in the "sectors" _after_ the > one your were trying to write to got trashed. Well, the journal won't notice. An e2fsck will notice huge swaths of missing metadata, but won't be able to do anything about it. (And if what got zapped was file _contents_ rather than metadata, you're on your own finding it. Fun, isn't it?) > Pavel Rob ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-04 22:55 ` Pavel Machek 2009-01-05 0:16 ` david 2009-01-05 1:50 ` Rob Landley @ 2009-01-05 3:20 ` Martin K. Petersen 2009-01-05 9:45 ` Pavel Machek 2 siblings, 1 reply; 67+ messages in thread From: Martin K. Petersen @ 2009-01-05 3:20 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc >>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes: Pavel> Does this sound like a fair summary? Pavel> Sector writes are atomic (ATOMIC-SECTORS) Pavel> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I'd just like to point out that the all-or-nothing hardware sector atomity thing is -- to a large extent -- a myth. It is mostly true on SCSI class devices because various UNIX, RAID array and database vendors have spent many years leaning very hard on the drive manufacturers to make it so. But it's not a hard guarantee, you can't get it in writing, and it's not in any of the standards. Hybrid drives with flash had potential to close that particular loophole but those appear to be dead in the water. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-05 3:20 ` Martin K. Petersen @ 2009-01-05 9:45 ` Pavel Machek 2009-01-05 11:28 ` Alan Cox 2009-01-05 19:15 ` Martin K. Petersen 0 siblings, 2 replies; 67+ messages in thread From: Pavel Machek @ 2009-01-05 9:45 UTC (permalink / raw) To: Martin K. Petersen Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc > >>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes: > > Pavel> Does this sound like a fair summary? > > Pavel> Sector writes are atomic (ATOMIC-SECTORS) > Pavel> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > I'd just like to point out that the all-or-nothing hardware sector > atomity thing is -- to a large extent -- a myth. It is a myth that linux filesystems depend on for safe operation :-(. > It is mostly true on SCSI class devices because various UNIX, RAID array > and database vendors have spent many years leaning very hard on the > drive manufacturers to make it so. > > But it's not a hard guarantee, you can't get it in writing, and it's not > in any of the standards. Hybrid drives with flash had potential to > close that particular loophole but those appear to be dead in the water. So "in practice it works but vendors will not guarantee that"? How much true is it for normal SATA drives? Are there some tests I can just run on a machine, powercycle it few times, and it tells me if my disk is non-ATOMIC-SECTORS? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-05 9:45 ` Pavel Machek @ 2009-01-05 11:28 ` Alan Cox 2009-01-05 19:15 ` Martin K. Petersen 1 sibling, 0 replies; 67+ messages in thread From: Alan Cox @ 2009-01-05 11:28 UTC (permalink / raw) To: Pavel Machek Cc: Martin K. Petersen, Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc > How much true is it for normal SATA drives? Are there some tests I can > just run on a machine, powercycle it few times, and it tells me if my > disk is non-ATOMIC-SECTORS? No. And even if it did writes to one sector can damage another. The mathematical certainly stuff lives only in the world of maths. In the real world everything is probabilities. Alan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-05 9:45 ` Pavel Machek 2009-01-05 11:28 ` Alan Cox @ 2009-01-05 19:15 ` Martin K. Petersen 2009-01-05 20:19 ` Theodore Tso 1 sibling, 1 reply; 67+ messages in thread From: Martin K. Petersen @ 2009-01-05 19:15 UTC (permalink / raw) To: Pavel Machek Cc: Martin K. Petersen, Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc >>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes: >> It is mostly true on SCSI class devices because various UNIX, RAID >> array and database vendors have spent many years leaning very hard on >> the drive manufacturers to make it so. >> >> But it's not a hard guarantee, you can't get it in writing, and it's >> not in any of the standards. Hybrid drives with flash had potential >> to close that particular loophole but those appear to be dead in the >> water. Pavel> So "in practice it works but vendors will not guarantee that"? It works some of the time. But in reality if you yank power halfway during a write operation the end result is undefined. The saving grace for normal users is that the potential corruption is limited to a couple of sectors. The current suck of flash SSDs is that the erase block size amplifies this problem by at least one order of magnitude, often two. I have a couple of SSDs here that will leave my filesystem in shambles every time the machine crashes. I quickly got tired of reinstalling Fedora several times per week so now my main machine is back to spinning media. The people that truly and deeply care about this type of write atomicity (i.e. enterprises) deploy disk arrays that will do the right thing in face of an error. This involves NVRAM, mirrored caches, uninterruptible power supplies, etc. Brute force if you will. High-end arrays even give you atomicity at a bigger granularity such as filesystem or database blocks. On some storage you can say "this LUN is used for an Oracle database that always writes in multiples of 8KB" and the array will guarantee that each 8KB block of the I/O is written in its entirety or not at all. Some arrays even allow you to verify Oracle logical block checksums to ensure that the I/O is intact and internally consistent. I have been bugging storage vendors about a per-I/O write atomicity setting for a while. But it really messes up their pipelining so they aren't keen on the idea. We may be able to get some of it fixed as a side-effect of the DIF bits vs. the impending switch to 4KB sectors, though. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: document ext3 requirements 2009-01-05 19:15 ` Martin K. Petersen @ 2009-01-05 20:19 ` Theodore Tso 0 siblings, 0 replies; 67+ messages in thread From: Theodore Tso @ 2009-01-05 20:19 UTC (permalink / raw) To: Martin K. Petersen Cc: Pavel Machek, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc On Mon, Jan 05, 2009 at 02:15:44PM -0500, Martin K. Petersen wrote: > > It works some of the time. But in reality if you yank power halfway > during a write operation the end result is undefined. > > The saving grace for normal users is that the potential corruption is > limited to a couple of sectors. A few years ago it was asserted to me that the internal block size for spinning magnetic media was around 32k. So if the hard drive doesn't have enough of a capacitor or other energy reserve to complete its internal read-modify-write cycle, attempts to read the 32k chunk of disk could result in hard ECC failures that would cause the blocks in question to all return uncorrectiable read errors when they are accessed. Of course, if the memory goes south first, and you're in the middle of streaming a 128k update to the inode the filesystem, and the power fails, and the memory start returning garbage during the DMA operation, you may have much bigger problems. :-) So it's probably more than "a couple of sectors".... > The current suck of flash SSDs is that the erase block size amplifies > this problem by at least one order of magnitude, often two. I have a > couple of SSDs here that will leave my filesystem in shambles every time > the machine crashes. I quickly got tired of reinstalling Fedora several > times per week so now my main machine is back to spinning media. The erase block size is typically 1 to 4 megabytes, from my understanding. So yeah, that's easily 1-2 orders of magnitude. Worse yet, flash's sequential streaming write speeds are much slower than hard drive's (anywhere from a factor of 3 to 12 depending on cheap/trashy the flash drive happens to be), so that opens the time window even further, by possibly as much as another order of magnitude. I also suspect that HDD manufactures have learned various tricks (due to enterprise storage/database vendors leaning on them) to make the drives appear more atomic in the face of hard drive errors, and also, in Pavel's case, as I recall he was using the card in a laptop where the SD card protruded slightly from the laptop case, and it was very easy for it to get dislodged, meaning that power failures during writes were even more likely than you would expect with a fixed HDD or SDD which is secured into place using screws or other more reliable mounting hardware. Put all of this together, given that Pavel's Really Trashy 32GB SD was probably the full 3 orders of magnitude worse than traditional HDD, and he was having many more failures due to physical mounting issues, it's not surprising that most people haven't see problems with traditional HDD's, even none of this is guaranteed by the hard drive vendors. > The people that truly and deeply care about this type of write atomicity > (i.e. enterprises) deploy disk arrays that will do the right thing in > face of an error. This involves NVRAM, mirrored caches, uninterruptible > power supplies, etc. Brute force if you will. Don't forget non-cheasy mounting options so an accidental brush against the side of the unit doesn't cause the hard drive to become disconnected from system and suffer a power drop. I guess that gets filed under "Brute force" as well. :-) - Ted P.S. I feel obliged to point out that in my Lenovo X61s, the SD card is flush with the laptop case when inserted, and I've never had a problem with the SD card prematurely ejected during operaiton. :-) ^ permalink raw reply [flat|nested] 67+ messages in thread
end of thread, other threads:[~2009-01-09 23:48 UTC | newest] Thread overview: 67+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-01-03 12:38 document ext3 requirements Pavel Machek 2009-01-03 21:17 ` Martin MOKREJŠ 2009-01-03 22:06 ` Pavel Machek 2009-01-03 22:17 ` Duane Griffin 2009-01-03 22:29 ` Pavel Machek 2009-01-03 23:01 ` Martin MOKREJŠ 2009-01-03 23:38 ` Duane Griffin 2009-01-03 23:50 ` Martin MOKREJŠ 2009-01-03 23:58 ` Robert Hancock 2009-01-04 0:08 ` Martin MOKREJŠ 2009-01-04 21:49 ` Ingo Oeser 2009-01-04 0:00 ` Duane Griffin 2009-01-04 0:11 ` Martin MOKREJŠ 2009-01-04 0:41 ` Duane Griffin 2009-01-04 3:52 ` Valdis.Kletnieks 2009-01-04 14:24 ` Duane Griffin 2009-01-04 18:40 ` Theodore Tso 2009-01-04 19:21 ` Geert Uytterhoeven 2009-01-04 19:36 ` Theodore Tso 2009-01-04 19:51 ` Duane Griffin 2009-01-04 21:55 ` Theodore Tso 2009-01-04 22:06 ` Duane Griffin 2009-01-04 22:42 ` Bron Gondwana 2009-01-05 3:22 ` Rob Landley 2009-01-04 0:19 ` Pavel Machek 2009-01-05 2:55 ` Rob Landley 2009-01-04 19:56 ` Rob Landley 2009-01-05 19:16 ` Theodore Tso 2009-01-06 19:20 ` Rob Landley 2009-01-06 10:08 ` Matthias Andree 2009-01-06 15:23 ` Theodore Tso 2009-01-03 23:12 ` Duane Griffin 2009-01-06 10:06 ` Matthias Andree 2009-01-04 2:32 ` Theodore Tso 2009-01-04 22:33 ` Pavel Machek 2009-01-04 22:34 ` [patch] document ext3 a bit better Pavel Machek 2009-01-05 14:57 ` Theodore Tso 2009-01-06 9:21 ` Pavel Machek 2009-01-09 23:24 ` Jiri Kosina 2009-01-09 23:36 ` Randy Dunlap 2009-01-09 23:47 ` Jiri Kosina 2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov 2009-01-04 13:53 ` Valdis.Kletnieks 2009-01-04 18:21 ` Michael Tokarev 2009-01-04 18:38 ` Theodore Tso 2009-01-04 22:37 ` Pavel Machek 2009-01-04 23:58 ` Theodore Tso 2009-01-05 11:43 ` Alan Cox 2009-01-07 11:59 ` Rob Landley 2009-01-04 20:10 ` Pavel Machek 2009-01-04 19:49 ` Rob Landley 2009-01-04 22:06 ` Theodore Tso 2009-01-04 22:25 ` Pavel Machek 2009-01-04 23:00 ` [patch] " Pavel Machek 2009-01-05 2:42 ` Rob Landley 2009-01-05 9:54 ` Pavel Machek 2009-01-04 23:07 ` Pavel Machek 2009-01-05 1:38 ` Rob Landley 2009-01-04 22:55 ` Pavel Machek 2009-01-05 0:16 ` david 2009-01-05 9:38 ` Pavel Machek 2009-01-05 1:50 ` Rob Landley 2009-01-05 3:20 ` Martin K. Petersen 2009-01-05 9:45 ` Pavel Machek 2009-01-05 11:28 ` Alan Cox 2009-01-05 19:15 ` Martin K. Petersen 2009-01-05 20:19 ` Theodore Tso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox