document ext3 requirements

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* document ext3 requirements
@ 2009-01-03 12:38 Pavel Machek
  2009-01-03 21:17 ` Martin MOKREJŠ
                   ` (3 more replies)
  0 siblings, 4 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-03 12:38 UTC (permalink / raw)
  To: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc

Using ext3 is only safe if storage subsystem meets certain
criteria. Document those.

Errors=remount-ro is documented as default, but superblock setting
overrides that and mkfs defaults to errors=continue... so the default
is errors=continue in practice.

readonly mount does actually write to the media in some cases. Document that.

Signed-off-by: Pavel Machek <pavel@suse.cz>

diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..74a73b0 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -14,6 +14,9 @@ Options
 When mounting an ext3 filesystem, the following option are accepted:
 (*) == default
 
+ro			Note that ext3 will replay the journal (and thus write
+			to the partition) even when mounted "read only".
+
 journal=update		Update the ext3 file system's journal to the current
 			format.
 
@@ -95,6 +98,8 @@ debug			Extra debugging information is sent to syslog.
 errors=remount-ro(*)	Remount the filesystem read-only on an error.
 errors=continue		Keep going on a filesystem error.
 errors=panic		Panic and halt the machine if an error occurs.
+			(Note that default is overriden by superblock
+			setting on most systems).
 
 data_err=ignore(*)	Just print an error message if an error occurs
 			in a file data buffer in ordered mode.
@@ -188,6 +193,34 @@ mke2fs: 	create a ext3 partition with the -j flag.
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* writes to media never fail. Even if disk returns error condition during
+  write, ext3 can't handle that correctly, because success on fsync was already
+  returned when data hit the journal.
+
+	   (Fortunately writes failing are very uncommon on disks, as they
+	   have spare sectors they use when write fails.)
+
+* either whole sector is correctly written or nothing is written during
+  powerfail.
+
+	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
+	   like this, and are unsuitable for ext3. Because RAM tends to fail
+	   faster than rest of system during powerfail, special hw killing
+	   DMA transfers may be neccessary. Not sure how common that problem
+	   is on generic PC machines).
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
 
 References
 ==========

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 12:38 document ext3 requirements Pavel Machek
@ 2009-01-03 21:17 ` Martin MOKREJŠ
  2009-01-03 22:06   ` Pavel Machek
  2009-01-03 22:17   ` Duane Griffin
  2009-01-04  2:32 ` Theodore Tso
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 67+ messages in thread
From: Martin MOKREJŠ @ 2009-01-03 21:17 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc

Can one avoid replay of the journal then if it would be unclean?
Just curious.
M.

Pavel Machek wrote:
> Using ext3 is only safe if storage subsystem meets certain
> criteria. Document those.
> 
> Errors=remount-ro is documented as default, but superblock setting
> overrides that and mkfs defaults to errors=continue... so the default
> is errors=continue in practice.
> 
> readonly mount does actually write to the media in some cases. Document that.
> 
> Signed-off-by: Pavel Machek <pavel@suse.cz>
> 
> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 21:17 ` Martin MOKREJŠ
@ 2009-01-03 22:06   ` Pavel Machek
  2009-01-03 22:17   ` Duane Griffin
  1 sibling, 0 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-03 22:06 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc

On Sat 2009-01-03 22:17:11, Martin MOKREJŠ wrote:
> Can one avoid replay of the journal then if it would be unclean?
> Just curious.

Well, mounting unclean filesystem is dangerous but depending on
circumstances, it may be better than writing to the filesystems.

(You may not be able to read some data and may provoke kernel bugs,
but at least you don't damage what is on disk. If you are collecting
evidence -- not writing is very important. If you suspect something is
very wrong with the drive, not writing is good idea).

								Pavel
> 
> Pavel Machek wrote:
> > Using ext3 is only safe if storage subsystem meets certain
> > criteria. Document those.
> > 
> > Errors=remount-ro is documented as default, but superblock setting
> > overrides that and mkfs defaults to errors=continue... so the default
> > is errors=continue in practice.
> > 
> > readonly mount does actually write to the media in some cases. Document that.
> > 
> > Signed-off-by: Pavel Machek <pavel@suse.cz>
> > 
> > diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 21:17 ` Martin MOKREJŠ
  2009-01-03 22:06   ` Pavel Machek
@ 2009-01-03 22:17   ` Duane Griffin
  2009-01-03 22:29     ` Pavel Machek
  1 sibling, 1 reply; 67+ messages in thread
From: Duane Griffin @ 2009-01-03 22:17 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

[Fixed top-posting]

2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> Pavel Machek wrote:
>> readonly mount does actually write to the media in some cases. Document that.
>>
> Can one avoid replay of the journal then if it would be unclean?
> Just curious.

Nope. If the underlying block device is read-only then mounting the
filesystem will fail. I tried to fix this some time ago, and have a
set of patches that almost always work, but "almost always" isn't good
enough. Unfortunately I never managed to figure out a way to finish it
off without disgusting hacks or major surgery.

> M.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 22:17   ` Duane Griffin
@ 2009-01-03 22:29     ` Pavel Machek
  2009-01-03 23:01       ` Martin MOKREJŠ
                         ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-03 22:29 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Martin MOKREJŠ, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> [Fixed top-posting]
> 
> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> > Pavel Machek wrote:
> >> readonly mount does actually write to the media in some cases. Document that.
> >>
> > Can one avoid replay of the journal then if it would be unclean?
> > Just curious.
> 
> Nope. If the underlying block device is read-only then mounting the
> filesystem will fail. I tried to fix this some time ago, and have a
> set of patches that almost always work, but "almost always" isn't good
> enough. Unfortunately I never managed to figure out a way to finish it
> off without disgusting hacks or major surgery.

Uhuh, can you just ignore the journal and mount it anyway?
...basically treating it like an ext2?

...ok, that will present "old" version of the filesystem to the
user... violating fsync() semantics.

Still handy for recovering badly broken filesystems, I'd say.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 22:29     ` Pavel Machek
@ 2009-01-03 23:01       ` Martin MOKREJŠ
  2009-01-03 23:38         ` Duane Griffin
                           ` (3 more replies)
  2009-01-03 23:12       ` Duane Griffin
  2009-01-06 10:06       ` Matthias Andree
  2 siblings, 4 replies; 67+ messages in thread
From: Martin MOKREJŠ @ 2009-01-03 23:01 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Duane Griffin, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

Pavel Machek wrote:
> On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
>> [Fixed top-posting]
>>
>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>>> Pavel Machek wrote:
>>>> readonly mount does actually write to the media in some cases. Document that.
>>>>
>>> Can one avoid replay of the journal then if it would be unclean?
>>> Just curious.
>> Nope. If the underlying block device is read-only then mounting the
>> filesystem will fail. I tried to fix this some time ago, and have a
>> set of patches that almost always work, but "almost always" isn't good
>> enough. Unfortunately I never managed to figure out a way to finish it
>> off without disgusting hacks or major surgery.
> 
> Uhuh, can you just ignore the journal and mount it anyway?
> ...basically treating it like an ext2?
> 
> ...ok, that will present "old" version of the filesystem to the
> user... violating fsync() semantics.

Hmm, so if my dual-boot machine does not shutdown correctly and I boot
accidentally in M$ Win where I use ext2 IFS driver and modify some
stuff on the ext3 drive, after a while reboot to linux and the journal
get re-played ... Mmm ...

> 
> Still handy for recovering badly broken filesystems, I'd say.

Me as well. How about improving you doc patch with some summary of
this thread (although it is probably not over yet)? ;-) Definitely,
a note that one can mount it as ext2 while read-only would be helpful
when doing some forensics on the disk.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:01       ` Martin MOKREJŠ
@ 2009-01-03 23:38         ` Duane Griffin
  2009-01-03 23:50           ` Martin MOKREJŠ
  2009-01-04  0:19         ` Pavel Machek
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 67+ messages in thread
From: Duane Griffin @ 2009-01-03 23:38 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> accidentally in M$ Win where I use ext2 IFS driver and modify some
> stuff on the ext3 drive, after a while reboot to linux and the journal
> get re-played ... Mmm ...

You *really* wouldn't want to be doing that.

The other scenario that people have reported trouble with is
suspending the system, booting a live CD which "read-only" mounts the
filesystem (and replays the journal), then resuming.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:38         ` Duane Griffin
@ 2009-01-03 23:50           ` Martin MOKREJŠ
  2009-01-03 23:58             ` Robert Hancock
  2009-01-04  0:00             ` Duane Griffin
  0 siblings, 2 replies; 67+ messages in thread
From: Martin MOKREJŠ @ 2009-01-03 23:50 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

Duane Griffin wrote:
> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
>> accidentally in M$ Win where I use ext2 IFS driver and modify some
>> stuff on the ext3 drive, after a while reboot to linux and the journal
>> get re-played ... Mmm ...
> 
> You *really* wouldn't want to be doing that.
> 
> The other scenario that people have reported trouble with is
> suspending the system, booting a live CD which "read-only" mounts the
> filesystem (and replays the journal), then resuming.

Why does not "mount -ro" die when it would have to replay the journal
with a message that user must run fsck.ext3 in order to be able to mount
it albeit read-only? Still I would prefer having an extra switch to
force mount RO while not touching the journal for disk forensics.
I think that would also prevent the cases when a LiveCD/rescue distribution
would not mount+replay it automagically but user would really have to
provide the switch to the command. I am really not using the recovery
boot cd to touch my partitions in some cases unwillingly.

Sure that does not prevent my case when I let ext2 IFS writing onto
my ext3 partition. Actually, couldn't the driver at least warn me
the journal log is non-empty (am just a user, sorry, cannot check
myself the code at www.fs-driver.org if it could do at least this
although it does not understand ext3). ;-)

Martin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:50           ` Martin MOKREJŠ
@ 2009-01-03 23:58             ` Robert Hancock
  2009-01-04  0:08               ` Martin MOKREJŠ
  2009-01-04 21:49               ` Ingo Oeser
  2009-01-04  0:00             ` Duane Griffin
  1 sibling, 2 replies; 67+ messages in thread
From: Robert Hancock @ 2009-01-03 23:58 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Duane Griffin, Pavel Machek, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

Martin MOKREJŠ wrote:
> Duane Griffin wrote:
>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
>>> accidentally in M$ Win where I use ext2 IFS driver and modify some
>>> stuff on the ext3 drive, after a while reboot to linux and the journal
>>> get re-played ... Mmm ...
>> You *really* wouldn't want to be doing that.
>>
>> The other scenario that people have reported trouble with is
>> suspending the system, booting a live CD which "read-only" mounts the
>> filesystem (and replays the journal), then resuming.
> 
> Why does not "mount -ro" die when it would have to replay the journal
> with a message that user must run fsck.ext3 in order to be able to mount
> it albeit read-only? Still I would prefer having an extra switch to

That would break typical system bootup in the unclean journal case, 
normally the root FS is mounted read-only to start with (which replays 
the journal) and remounted read-write later on - and usually the fsck 
utilities are located on the root filesystem..

> force mount RO while not touching the journal for disk forensics.
> I think that would also prevent the cases when a LiveCD/rescue distribution
> would not mount+replay it automagically but user would really have to
> provide the switch to the command. I am really not using the recovery
> boot cd to touch my partitions in some cases unwillingly.

I agree, there should be a way to force it to mount "really read only" 
so it doesn't try to replay the journal. That might require just 
ignoring the journal content, which may result in the FS appearing 
corrupt, but for recovery/forensics purposes that seems better than 
nothing..

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:58             ` Robert Hancock
@ 2009-01-04  0:08               ` Martin MOKREJŠ
  2009-01-04 21:49               ` Ingo Oeser
  1 sibling, 0 replies; 67+ messages in thread
From: Martin MOKREJŠ @ 2009-01-04  0:08 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Duane Griffin, Pavel Machek, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

Robert Hancock wrote:
> Martin MOKREJŠ wrote:
>> Duane Griffin wrote:
>>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>>>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
>>>> accidentally in M$ Win where I use ext2 IFS driver and modify some
>>>> stuff on the ext3 drive, after a while reboot to linux and the journal
>>>> get re-played ... Mmm ...
>>> You *really* wouldn't want to be doing that.
>>>
>>> The other scenario that people have reported trouble with is
>>> suspending the system, booting a live CD which "read-only" mounts the
>>> filesystem (and replays the journal), then resuming.
>>
>> Why does not "mount -ro" die when it would have to replay the journal
>> with a message that user must run fsck.ext3 in order to be able to mount
>> it albeit read-only? Still I would prefer having an extra switch to
> 
> That would break typical system bootup in the unclean journal case,
> normally the root FS is mounted read-only to start with (which replays
> the journal) and remounted read-write later on - and usually the fsck
> utilities are located on the root filesystem..

Couldn't that be handled by e.g. openRC during boot, to provide the
say to be provided --force-journal-replay during "normal" boot?
Yes, that would mean e2fsprogs would become incompatible with older
versions but why not "fix" the logic?

> 
>> force mount RO while not touching the journal for disk forensics. I
>> think that would also prevent the cases when a LiveCD/rescue 
>> distribution would not mount+replay it automagically but user would
>> really have to provide the switch to the command. I am really not
>> using the recovery boot cd to touch my partitions in some cases
>> unwillingly.
>
> I agree, there should be a way to force it to mount "really read only"
> so it doesn't try to replay the journal. That might require just
> ignoring the journal content, which may result in the FS appearing
> corrupt, but for recovery/forensics purposes that seems better than
> nothing..

Fully agree.

M.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:58             ` Robert Hancock
  2009-01-04  0:08               ` Martin MOKREJŠ
@ 2009-01-04 21:49               ` Ingo Oeser
  1 sibling, 0 replies; 67+ messages in thread
From: Ingo Oeser @ 2009-01-04 21:49 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Martin MOKREJŠ, Duane Griffin, Pavel Machek, kernel list,
	Andrew Morton, tytso, mtk.manpages, rdunlap, linux-doc

On Sunday 04 January 2009, Robert Hancock wrote:
> I agree, there should be a way to force it to mount "really read only" 
> so it doesn't try to replay the journal. That might require just 
> ignoring the journal content, which may result in the FS appearing 
> corrupt, but for recovery/forensics purposes that seems better than 
> nothing..

For forensics you ALWAYS get a copy of the full disk first, 
which you set read only with blockdev --setro /dev/$MYDISK.

You then restore from this copy.


Best Regard

Ingo Oeser, been there, done that

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:50           ` Martin MOKREJŠ
  2009-01-03 23:58             ` Robert Hancock
@ 2009-01-04  0:00             ` Duane Griffin
  2009-01-04  0:11               ` Martin MOKREJŠ
  1 sibling, 1 reply; 67+ messages in thread
From: Duane Griffin @ 2009-01-04  0:00 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> Why does not "mount -ro" die when it would have to replay the journal
> with a message that user must run fsck.ext3 in order to be able to mount
> it albeit read-only? Still I would prefer having an extra switch to
> force mount RO while not touching the journal for disk forensics.
> I think that would also prevent the cases when a LiveCD/rescue distribution
> would not mount+replay it automagically but user would really have to
> provide the switch to the command. I am really not using the recovery
> boot cd to touch my partitions in some cases unwillingly.

Well, that would make things rather tricky. As in, shutting down
uncleanly would render your system unbootable.

> Sure that does not prevent my case when I let ext2 IFS writing onto
> my ext3 partition. Actually, couldn't the driver at least warn me
> the journal log is non-empty (am just a user, sorry, cannot check
> myself the code at www.fs-driver.org if it could do at least this
> although it does not understand ext3). ;-)

The driver certainly should warn you in that case. I have no idea
whether it does, as I don't use it, sorry.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04  0:00             ` Duane Griffin
@ 2009-01-04  0:11               ` Martin MOKREJŠ
  2009-01-04  0:41                 ` Duane Griffin
  0 siblings, 1 reply; 67+ messages in thread
From: Martin MOKREJŠ @ 2009-01-04  0:11 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

Duane Griffin wrote:
> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>> Why does not "mount -ro" die when it would have to replay the journal
>> with a message that user must run fsck.ext3 in order to be able to mount
>> it albeit read-only? Still I would prefer having an extra switch to
>> force mount RO while not touching the journal for disk forensics.
>> I think that would also prevent the cases when a LiveCD/rescue distribution
>> would not mount+replay it automagically but user would really have to
>> provide the switch to the command. I am really not using the recovery
>> boot cd to touch my partitions in some cases unwillingly.
> 
> Well, that would make things rather tricky. As in, shutting down
> uncleanly would render your system unbootable.

??? If I am booted off a CD/DVD drive I just do not want my system
to be touched. I am fine if the dist mounts my drives automagically
in read-only mode but if that currently forces journal replay then no,
thanks. ;)

M.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04  0:11               ` Martin MOKREJŠ
@ 2009-01-04  0:41                 ` Duane Griffin
  2009-01-04  3:52                   ` Valdis.Kletnieks
  0 siblings, 1 reply; 67+ messages in thread
From: Duane Griffin @ 2009-01-04  0:41 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

2009/1/4 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> Duane Griffin wrote:
>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>>> Why does not "mount -ro" die when it would have to replay the journal
>>> with a message that user must run fsck.ext3 in order to be able to mount
>>> it albeit read-only? Still I would prefer having an extra switch to
>>> force mount RO while not touching the journal for disk forensics.
>>> I think that would also prevent the cases when a LiveCD/rescue distribution
>>> would not mount+replay it automagically but user would really have to
>>> provide the switch to the command. I am really not using the recovery
>>> boot cd to touch my partitions in some cases unwillingly.
>>
>> Well, that would make things rather tricky. As in, shutting down
>> uncleanly would render your system unbootable.
>
> ??? If I am booted off a CD/DVD drive I just do not want my system
> to be touched. I am fine if the dist mounts my drives automagically
> in read-only mode but if that currently forces journal replay then no,
> thanks. ;)

I agree, it isn't a great situation. Nonetheless, it has always been
thus for ext3, and so far we've muddled along. Unless and until we can
replay the journal in-memory without touching the on-disk data, we are
stuck with it.

We can't refuse to mount an unclean FS, as that would break booting.
We also can't ignore the journal by default, if/when we get a patch to
do so at all, as that effectively corrupts random chunks of the FS.
Fine for forensics and recovery; not so much for booting from.

> M.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04  0:41                 ` Duane Griffin
@ 2009-01-04  3:52                   ` Valdis.Kletnieks
  2009-01-04 14:24                     ` Duane Griffin
  0 siblings, 1 reply; 67+ messages in thread
From: Valdis.Kletnieks @ 2009-01-04  3:52 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton,
	tytso, mtk.manpages, rdunlap, linux-doc

[-- Attachment #1: Type: text/plain, Size: 665 bytes --]

On Sun, 04 Jan 2009 00:41:51 GMT, Duane Griffin said:

> I agree, it isn't a great situation. Nonetheless, it has always been
> thus for ext3, and so far we've muddled along. Unless and until we can
> replay the journal in-memory without touching the on-disk data, we are
> stuck with it.

Is there a way using md/dm/lvm etc to make the source partition R/O and
replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
the 'mount' command itself, but at least it might be workable for LiveCD R/O
mounts and forensics work, where you can *tell* beforehand that's what you
want and can jump through setup games before doing the mount...

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04  3:52                   ` Valdis.Kletnieks
@ 2009-01-04 14:24                     ` Duane Griffin
  2009-01-04 18:40                       ` Theodore Tso
  0 siblings, 1 reply; 67+ messages in thread
From: Duane Griffin @ 2009-01-04 14:24 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton,
	tytso, mtk.manpages, rdunlap, linux-doc

2009/1/4  <Valdis.Kletnieks@vt.edu>:
> On Sun, 04 Jan 2009 00:41:51 GMT, Duane Griffin said:
>
>> I agree, it isn't a great situation. Nonetheless, it has always been
>> thus for ext3, and so far we've muddled along. Unless and until we can
>> replay the journal in-memory without touching the on-disk data, we are
>> stuck with it.
>
> Is there a way using md/dm/lvm etc to make the source partition R/O and
> replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
> the 'mount' command itself, but at least it might be workable for LiveCD R/O
> mounts and forensics work, where you can *tell* beforehand that's what you
> want and can jump through setup games before doing the mount...

Yes, something like that is best practice, as I understand it. The
LiveCD init scripts could check whether they are about to R/O mount an
ext[34] filesystem needing recovery and either refuse with a useful
message to the user, or even automatically create and mount a COW
snapshot, as you described. They'd still need to warn the user though,
since things like remounting R/W wouldn't work as expected.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 14:24                     ` Duane Griffin
@ 2009-01-04 18:40                       ` Theodore Tso
  2009-01-04 19:21                         ` Geert Uytterhoeven
  0 siblings, 1 reply; 67+ messages in thread
From: Theodore Tso @ 2009-01-04 18:40 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Valdis.Kletnieks, Martin MOKREJŠ, Pavel Machek, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote:
> > Is there a way using md/dm/lvm etc to make the source partition R/O and
> > replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
> > the 'mount' command itself, but at least it might be workable for LiveCD R/O
> > mounts and forensics work, where you can *tell* beforehand that's what you
> > want and can jump through setup games before doing the mount...
> 
> Yes, something like that is best practice, as I understand it. The
> LiveCD init scripts could check whether they are about to R/O mount an
> ext[34] filesystem needing recovery and either refuse with a useful
> message to the user, or even automatically create and mount a COW
> snapshot, as you described. They'd still need to warn the user though,
> since things like remounting R/W wouldn't work as expected.

So what's the use case where people want to be able to mount a
filesystem needing recovery read/only without running the journal?

	   	   	    	      	      	      	  - Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 18:40                       ` Theodore Tso
@ 2009-01-04 19:21                         ` Geert Uytterhoeven
  2009-01-04 19:36                           ` Theodore Tso
                                             ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Geert Uytterhoeven @ 2009-01-04 19:21 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, 4 Jan 2009, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote:
> > > Is there a way using md/dm/lvm etc to make the source partition R/O and
> > > replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
> > > the 'mount' command itself, but at least it might be workable for LiveCD R/O
> > > mounts and forensics work, where you can *tell* beforehand that's what you
> > > want and can jump through setup games before doing the mount...
> > 
> > Yes, something like that is best practice, as I understand it. The
> > LiveCD init scripts could check whether they are about to R/O mount an
> > ext[34] filesystem needing recovery and either refuse with a useful
> > message to the user, or even automatically create and mount a COW
> > snapshot, as you described. They'd still need to warn the user though,
> > since things like remounting R/W wouldn't work as expected.
> 
> So what's the use case where people want to be able to mount a
> filesystem needing recovery read/only without running the journal?

As mentioned before, suspending a laptop (running from hdd), running a live CD,
and expecting everything to work fine when resuming from hdd?

I think most people get shocked when they discover that mounting something
read-only may actualy write to the media. This is a bit unexpected (hey, if I
mount `read-only', I expect that no writes will happen), as it behaved
differently before the introduction of journalling.

As for mounting the root file system read-only during early boot up, and
remounting it read-write later, I guess it's quite complicated to replay the
journal (in RAM) on read-only mount, and deferring the replay writeback until
remounting read-write?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:21                         ` Geert Uytterhoeven
@ 2009-01-04 19:36                           ` Theodore Tso
  2009-01-04 19:51                             ` Duane Griffin
  2009-01-04 22:42                           ` Bron Gondwana
  2009-01-05  3:22                           ` Rob Landley
  2 siblings, 1 reply; 67+ messages in thread
From: Theodore Tso @ 2009-01-04 19:36 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Duane Griffin, Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote:
> As mentioned before, suspending a laptop (running from hdd), running
> a live CD, and expecting everything to work fine when resuming from
> hdd?
> 
> I think most people get shocked when they discover that mounting
> something read-only may actualy write to the media. This is a bit
> unexpected (hey, if I mount `read-only', I expect that no writes
> will happen), as it behaved differently before the introduction of
> journalling.

It's been this way for about a decade....  that being said, if you
really want to do this, you can today via "mount -o ro,noload /dev/XXX
/mntpt".  However, the system could crash or fail because the
filesystem without having run the journal could be quite inconsistent.  

> As for mounting the root file system read-only during early boot up, and
> remounting it read-write later, I guess it's quite complicated to replay the
> journal (in RAM) on read-only mount, and deferring the replay writeback until
> remounting read-write?

It's not *that* hard; if someone would like to cons up a patch, please
feel free....  but it's certainly not a high priority for me or most
of the other ext3 filesystem developers.

					- Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:36                           ` Theodore Tso
@ 2009-01-04 19:51                             ` Duane Griffin
  2009-01-04 21:55                               ` Theodore Tso
  0 siblings, 1 reply; 67+ messages in thread
From: Duane Griffin @ 2009-01-04 19:51 UTC (permalink / raw)
  To: Theodore Tso, Geert Uytterhoeven, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

2009/1/4 Theodore Tso <tytso@mit.edu>:
> On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote:
>> As for mounting the root file system read-only during early boot up, and
>> remounting it read-write later, I guess it's quite complicated to replay the
>> journal (in RAM) on read-only mount, and deferring the replay writeback until
>> remounting read-write?
>
> It's not *that* hard; if someone would like to cons up a patch, please
> feel free....  but it's certainly not a high priority for me or most
> of the other ext3 filesystem developers.

If anyone is interested I'd be happy to dust off and send them my old
patches to implement this. There are a couple of issues with it.
First, I never got around to implementing remount R/W support. Second,
I had to introduce a rather nasty hack in order to handle un-escaping
JFS magic numbers.

>                                        - Ted

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:51                             ` Duane Griffin
@ 2009-01-04 21:55                               ` Theodore Tso
  2009-01-04 22:06                                 ` Duane Griffin
  0 siblings, 1 reply; 67+ messages in thread
From: Theodore Tso @ 2009-01-04 21:55 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Geert Uytterhoeven, Valdis.Kletnieks, Martin MOKREJŠ,
	Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, Jan 04, 2009 at 07:51:27PM +0000, Duane Griffin wrote:
> 
> If anyone is interested I'd be happy to dust off and send them my old
> patches to implement this. There are a couple of issues with it.
> First, I never got around to implementing remount R/W support. Second,
> I had to introduce a rather nasty hack in order to handle un-escaping
> JFS magic numbers.

Can you dust off the patches and send a copy to
linux-ext4@vger.kernel.org so we have them archived someplace where
hopefully someone might have time to look at it?

	  	  	     	     	  - Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 21:55                               ` Theodore Tso
@ 2009-01-04 22:06                                 ` Duane Griffin
  0 siblings, 0 replies; 67+ messages in thread
From: Duane Griffin @ 2009-01-04 22:06 UTC (permalink / raw)
  To: Theodore Tso, Duane Griffin, Geert Uytterhoeven, Valdis.Kletnieks,
	Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

2009/1/4 Theodore Tso <tytso@mit.edu>:
> On Sun, Jan 04, 2009 at 07:51:27PM +0000, Duane Griffin wrote:
>>
>> If anyone is interested I'd be happy to dust off and send them my old
>> patches to implement this. There are a couple of issues with it.
>> First, I never got around to implementing remount R/W support. Second,
>> I had to introduce a rather nasty hack in order to handle un-escaping
>> JFS magic numbers.
>
> Can you dust off the patches and send a copy to
> linux-ext4@vger.kernel.org so we have them archived someplace where
> hopefully someone might have time to look at it?

OK, will do. I've posted them there before, but not the latest version
that properly handles un-escaping JFS magic numbers (albeit in an ugly
way). I'll rebase them on top of the latest ext4 patch queue and
repost.

>                                          - Ted

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:21                         ` Geert Uytterhoeven
  2009-01-04 19:36                           ` Theodore Tso
@ 2009-01-04 22:42                           ` Bron Gondwana
  2009-01-05  3:22                           ` Rob Landley
  2 siblings, 0 replies; 67+ messages in thread
From: Bron Gondwana @ 2009-01-04 22:42 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Theodore Tso, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote:
> On Sun, 4 Jan 2009, Theodore Tso wrote:
> > On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote:
> > > > Is there a way using md/dm/lvm etc to make the source partition R/O and
> > > > replay the journal onto a CoW snapshop?  Admittedly, not easy to do inside
> > > > the 'mount' command itself, but at least it might be workable for LiveCD R/O
> > > > mounts and forensics work, where you can *tell* beforehand that's what you
> > > > want and can jump through setup games before doing the mount...
> > > 
> > > Yes, something like that is best practice, as I understand it. The
> > > LiveCD init scripts could check whether they are about to R/O mount an
> > > ext[34] filesystem needing recovery and either refuse with a useful
> > > message to the user, or even automatically create and mount a COW
> > > snapshot, as you described. They'd still need to warn the user though,
> > > since things like remounting R/W wouldn't work as expected.
> > 
> > So what's the use case where people want to be able to mount a
> > filesystem needing recovery read/only without running the journal?
> 
> As mentioned before, suspending a laptop (running from hdd), running a live CD,
> and expecting everything to work fine when resuming from hdd?

Any particular reason why suspend doesn't run the journal during
shutdown and leave a clean filesystem?  It shouldn't take that
long surely.

I know it doesn't solve the "it really just crashed" problem, but
you don't tend to unsuspend from a crash anyway.

Bron ( just curious )

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:21                         ` Geert Uytterhoeven
  2009-01-04 19:36                           ` Theodore Tso
  2009-01-04 22:42                           ` Bron Gondwana
@ 2009-01-05  3:22                           ` Rob Landley
  2 siblings, 0 replies; 67+ messages in thread
From: Rob Landley @ 2009-01-05  3:22 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Theodore Tso, Duane Griffin, Valdis.Kletnieks,
	Martin MOKREJŠ, Pavel Machek, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Sunday 04 January 2009 13:21:06 Geert Uytterhoeven wrote:
> I think most people get shocked when they discover that mounting something
> read-only may actualy write to the media. This is a bit unexpected (hey, if
> I mount `read-only', I expect that no writes will happen), as it behaved
> differently before the introduction of journalling.

Is this an unreasonable use case:

  kill -STOP $(pidof qemu)
  mount -o loop,ro hdb.img blah
  cp blah/thingy thingy
  umount blah
  kill -CONT $(pidof qemu)

Currently, if your loopback mount is -t ext3 it'll write to the block device, 
and if your mount is -t ext2 it'll refuse to work on an unclean ext3 
filesystem, even if it's read only.  (But it _will_ work on an unclean ext2 
filesystem.)

My theory when I first found out about this was "the filesystem developers 
hate me personally".

Rob

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:01       ` Martin MOKREJŠ
  2009-01-03 23:38         ` Duane Griffin
@ 2009-01-04  0:19         ` Pavel Machek
  2009-01-05  2:55           ` Rob Landley
  2009-01-04 19:56         ` Rob Landley
  2009-01-06 10:08         ` Matthias Andree
  3 siblings, 1 reply; 67+ messages in thread
From: Pavel Machek @ 2009-01-04  0:19 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Duane Griffin, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

On Sun 2009-01-04 00:01:58, Martin MOKREJŠ wrote:
> Pavel Machek wrote:
> > On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> >> [Fixed top-posting]
> >>
> >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> >>> Pavel Machek wrote:
> >>>> readonly mount does actually write to the media in some cases. Document that.
> >>>>
> >>> Can one avoid replay of the journal then if it would be unclean?
> >>> Just curious.
> >> Nope. If the underlying block device is read-only then mounting the
> >> filesystem will fail. I tried to fix this some time ago, and have a
> >> set of patches that almost always work, but "almost always" isn't good
> >> enough. Unfortunately I never managed to figure out a way to finish it
> >> off without disgusting hacks or major surgery.
> > 
> > Uhuh, can you just ignore the journal and mount it anyway?
> > ...basically treating it like an ext2?
> > 
> > ...ok, that will present "old" version of the filesystem to the
> > user... violating fsync() semantics.
> 
> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> accidentally in M$ Win where I use ext2 IFS driver and modify some
> stuff on the ext3 drive, after a while reboot to linux and the journal
> get re-played ... Mmm ...

ext2 driver should refuse to mount dirty ext3 filesystem. (Linux ext2
driver does that).

> > Still handy for recovering badly broken filesystems, I'd say.
> 
> Me as well. How about improving you doc patch with some summary of
> this thread (although it is probably not over yet)? ;-) Definitely,
> a note that one can mount it as ext2 while read-only would be helpful
> when doing some forensics on the disk.

No, you can't mount unclean ext3 as an ext2; patch to do that would be
possible but...

I believe the patch is correct & useful.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04  0:19         ` Pavel Machek
@ 2009-01-05  2:55           ` Rob Landley
  0 siblings, 0 replies; 67+ messages in thread
From: Rob Landley @ 2009-01-05  2:55 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Martin MOKREJŠ, Duane Griffin, kernel list, Andrew Morton,
	tytso, mtk.manpages, rdunlap, linux-doc

On Saturday 03 January 2009 18:19:00 Pavel Machek wrote:
> No, you can't mount unclean ext3 as an ext2; patch to do that would be
> possible but...

tune2fs -O ^has_journal /dev/blah
fsck.ext2 -f /dev/blah

Rob

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:01       ` Martin MOKREJŠ
  2009-01-03 23:38         ` Duane Griffin
  2009-01-04  0:19         ` Pavel Machek
@ 2009-01-04 19:56         ` Rob Landley
  2009-01-05 19:16           ` Theodore Tso
  2009-01-06 10:08         ` Matthias Andree
  3 siblings, 1 reply; 67+ messages in thread
From: Rob Landley @ 2009-01-04 19:56 UTC (permalink / raw)
  To: Martin MOKREJŠ
  Cc: Pavel Machek, Duane Griffin, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote:
> > Still handy for recovering badly broken filesystems, I'd say.
>
> Me as well. How about improving you doc patch with some summary of
> this thread (although it is probably not over yet)? ;-) Definitely,
> a note that one can mount it as ext2 while read-only would be helpful
> when doing some forensics on the disk.

Although make sure you _do_ mount it as read only because if you mount an ext3 
filesystem read/write as ext2 I've had it zap the journal entirely and then 
you have to tune2fs -j the sucker to turn it back into ext3.

Ext3 is... touchy.

Rob

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:56         ` Rob Landley
@ 2009-01-05 19:16           ` Theodore Tso
  2009-01-06 19:20             ` Rob Landley
  0 siblings, 1 reply; 67+ messages in thread
From: Theodore Tso @ 2009-01-05 19:16 UTC (permalink / raw)
  To: Rob Landley
  Cc: Martin MOKREJŠ, Pavel Machek, Duane Griffin, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sun, Jan 04, 2009 at 01:56:32PM -0600, Rob Landley wrote:
> On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote:
> > > Still handy for recovering badly broken filesystems, I'd say.
> >
> > Me as well. How about improving you doc patch with some summary of
> > this thread (although it is probably not over yet)? ;-) Definitely,
> > a note that one can mount it as ext2 while read-only would be helpful
> > when doing some forensics on the disk.
> 
> Although make sure you _do_ mount it as read only because if you mount an ext3 
> filesystem read/write as ext2 I've had it zap the journal entirely and then 
> you have to tune2fs -j the sucker to turn it back into ext3.
> 
> Ext3 is... touchy.

Um.... horse pucky:

# mke2fs -q -t ext3 /dev/thunk/footest
# debugfs -R features /dev/thunk/footest
debugfs 1.41.3 (12-Oct-2008)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype sparse_super large_file
# mount -t ext2 /dev/thunk/footest /mnt
# touch /mnt/foo
# umount /mnt
# debugfs -R features /dev/thunk/footest
debugfs 1.41.3 (12-Oct-2008)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype sparse_super large_file

   	     		 	  	       		 - Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-05 19:16           ` Theodore Tso
@ 2009-01-06 19:20             ` Rob Landley
  0 siblings, 0 replies; 67+ messages in thread
From: Rob Landley @ 2009-01-06 19:20 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Martin MOKREJŠ, Pavel Machek, Duane Griffin, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Monday 05 January 2009 13:16:58 Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:56:32PM -0600, Rob Landley wrote:
> > On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote:
> > > > Still handy for recovering badly broken filesystems, I'd say.
> > >
> > > Me as well. How about improving you doc patch with some summary of
> > > this thread (although it is probably not over yet)? ;-) Definitely,
> > > a note that one can mount it as ext2 while read-only would be helpful
> > > when doing some forensics on the disk.
> >
> > Although make sure you _do_ mount it as read only because if you mount an
> > ext3 filesystem read/write as ext2 I've had it zap the journal entirely
> > and then you have to tune2fs -j the sucker to turn it back into ext3.
> >
> > Ext3 is... touchy.
>
> Um.... horse pucky:

Well I managed to kill it more than once, but I could easily have the 
reproduction sequence wrong.  (I wasn't _trying_ to do it again...)

> # mke2fs -q -t ext3 /dev/thunk/footest
> # debugfs -R features /dev/thunk/footest
> debugfs 1.41.3 (12-Oct-2008)
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype
> sparse_super large_file # mount -t ext2 /dev/thunk/footest /mnt
> # touch /mnt/foo
> # umount /mnt
> # debugfs -R features /dev/thunk/footest
> debugfs 1.41.3 (12-Oct-2008)
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype
> sparse_super large_file

If I can figure out what I did, I'll get back to you.

>    	     		 	  	       		 - Ted

Rob

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 23:01       ` Martin MOKREJŠ
                           ` (2 preceding siblings ...)
  2009-01-04 19:56         ` Rob Landley
@ 2009-01-06 10:08         ` Matthias Andree
  2009-01-06 15:23           ` Theodore Tso
  3 siblings, 1 reply; 67+ messages in thread
From: Matthias Andree @ 2009-01-06 10:08 UTC (permalink / raw)
  To: Martin MOKREJŠ; +Cc: Duane Griffin, kernel list

On Sun, 04 Jan 2009, Martin MOKREJŠ wrote:

> Pavel Machek wrote:
> > On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> >> [Fixed top-posting]
> >>
> >> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> >>> Pavel Machek wrote:
> >>>> readonly mount does actually write to the media in some cases. Document that.
> >>>>
> >>> Can one avoid replay of the journal then if it would be unclean?
> >>> Just curious.
> >> Nope. If the underlying block device is read-only then mounting the
> >> filesystem will fail. I tried to fix this some time ago, and have a
> >> set of patches that almost always work, but "almost always" isn't good
> >> enough. Unfortunately I never managed to figure out a way to finish it
> >> off without disgusting hacks or major surgery.
> > 
> > Uhuh, can you just ignore the journal and mount it anyway?
> > ...basically treating it like an ext2?
> > 
> > ...ok, that will present "old" version of the filesystem to the
> > user... violating fsync() semantics.
> 
> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> accidentally in M$ Win where I use ext2 IFS driver and modify some
> stuff on the ext3 drive, after a while reboot to linux and the journal
> get re-played ... Mmm ...

If the ext2 IFS driver mounts an ext3 file system that needs journal
replay, the IFS driver is broken (unless it can replay the journal, of
course - I stopped using that driver long ago, being unhappy with it).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-06 10:08         ` Matthias Andree
@ 2009-01-06 15:23           ` Theodore Tso
  0 siblings, 0 replies; 67+ messages in thread
From: Theodore Tso @ 2009-01-06 15:23 UTC (permalink / raw)
  To: Martin MOKREJŠ, Duane Griffin, kernel list

On Tue, Jan 06, 2009 at 11:08:10AM +0100, Matthias Andree wrote:
> On Sun, 04 Jan 2009, Martin MOKREJŠ wrote:
> > Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> > accidentally in M$ Win where I use ext2 IFS driver and modify some
> > stuff on the ext3 drive, after a while reboot to linux and the journal
> > get re-played ... Mmm ...
> 
> If the ext2 IFS driver mounts an ext3 file system that needs journal
> replay, the IFS driver is broken (unless it can replay the journal, of
> course - I stopped using that driver long ago, being unhappy with it).

Indeed; that's why there is a INCOMPAT NEEDS_RECOVERY feature flag to
prevent compliant ext2 implementations from mounting an ext3
filesystem that needs recovery.  We've thought about most of these
issues, almost a decade ago...

							- Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 22:29     ` Pavel Machek
  2009-01-03 23:01       ` Martin MOKREJŠ
@ 2009-01-03 23:12       ` Duane Griffin
  2009-01-06 10:06       ` Matthias Andree
  2 siblings, 0 replies; 67+ messages in thread
From: Duane Griffin @ 2009-01-03 23:12 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Martin MOKREJŠ, kernel list, Andrew Morton, tytso,
	mtk.manpages, rdunlap, linux-doc

2009/1/3 Pavel Machek <pavel@suse.cz>:
> On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
>> [Fixed top-posting]
>>
>> 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
>> > Pavel Machek wrote:
>> >> readonly mount does actually write to the media in some cases. Document that.
>> >>
>> > Can one avoid replay of the journal then if it would be unclean?
>> > Just curious.
>>
>> Nope. If the underlying block device is read-only then mounting the
>> filesystem will fail. I tried to fix this some time ago, and have a
>> set of patches that almost always work, but "almost always" isn't good
>> enough. Unfortunately I never managed to figure out a way to finish it
>> off without disgusting hacks or major surgery.
>
> Uhuh, can you just ignore the journal and mount it anyway?
> ...basically treating it like an ext2?

I'm afraid not, ext2 won't mount an FS with EXT3_FEATURE_INCOMPAT_RECOVER set.

> ...ok, that will present "old" version of the filesystem to the
> user... violating fsync() semantics.
>
> Still handy for recovering badly broken filesystems, I'd say.
>
>                                                                        Pavel

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 22:29     ` Pavel Machek
  2009-01-03 23:01       ` Martin MOKREJŠ
  2009-01-03 23:12       ` Duane Griffin
@ 2009-01-06 10:06       ` Matthias Andree
  2 siblings, 0 replies; 67+ messages in thread
From: Matthias Andree @ 2009-01-06 10:06 UTC (permalink / raw)
  To: kernel list

On Sat, 03 Jan 2009, Pavel Machek wrote:

> On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> > [Fixed top-posting]
> > 
> > 2009/1/3 Martin MOKREJŠ <mmokrejs@ribosome.natur.cuni.cz>:
> > > Pavel Machek wrote:
> > >> readonly mount does actually write to the media in some cases. Document that.
> > >>
> > > Can one avoid replay of the journal then if it would be unclean?
> > > Just curious.
> > 
> > Nope. If the underlying block device is read-only then mounting the
> > filesystem will fail. I tried to fix this some time ago, and have a
> > set of patches that almost always work, but "almost always" isn't good
> > enough. Unfortunately I never managed to figure out a way to finish it
> > off without disgusting hacks or major surgery.
> 
> Uhuh, can you just ignore the journal and mount it anyway?

An ext3 file system that needs journal recovery sets one of the ext2
incompatible flags to prevent just that.

> ...basically treating it like an ext2?
> 
> ...ok, that will present "old" version of the filesystem to the
> user... violating fsync() semantics.
> 
> Still handy for recovering badly broken filesystems, I'd say.

While you cannot have that, you'll need to dump the file system
(possibly with dd_rescue) to another medium and work on the copy.
That's what you should do anyways. ;-)

I think if you really want to mount the file system without journal
replay, you need to clear the needs-recovery "incompat" flag (on the
copy, obviously).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 12:38 document ext3 requirements Pavel Machek
  2009-01-03 21:17 ` Martin MOKREJŠ
@ 2009-01-04  2:32 ` Theodore Tso
  2009-01-04 22:33   ` Pavel Machek
  2009-01-04 22:34   ` [patch] document ext3 a bit better Pavel Machek
  2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov
  2009-01-04 19:49 ` Rob Landley
  3 siblings, 2 replies; 67+ messages in thread
From: Theodore Tso @ 2009-01-04  2:32 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sat, Jan 03, 2009 at 01:38:15PM +0100, Pavel Machek wrote:
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* writes to media never fail. Even if disk returns error condition during
> +  write, ext3 can't handle that correctly, because success on fsync was already
> +  returned when data hit the journal.
> +
> +	   (Fortunately writes failing are very uncommon on disks, as they
> +	   have spare sectors they use when write fails.)

This is not unique to ext3; per the discussion two weeks ago, this is
largely because of the fsync() interface not possibly being able to
return errors caused by failures when creating or modifying parent
directories.  Given this, it's a bit misleading to place this in the
Documentation/filesystems/ext3.txt.  At the minimum it should include
a discussion about what the issues might be, and given that pretty
much any Unix/Linux filesystem doesn't have a way of reflecting these
errors to application programs, it probably should be in a
filesystem-independent documentation file.

> +* either whole sector is correctly written or nothing is written during
> +  powerfail.
> +
> +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
> +	   like this, and are unsuitable for ext3. Because RAM tends to fail
> +	   faster than rest of system during powerfail, special hw killing
> +	   DMA transfers may be neccessary. Not sure how common that problem
> +	   is on generic PC machines).

Again, this is true for other filesystems (it was first discovered on
SGI "pizza boxes" machines running XFS, and special hardware changes
added to allow DMA aborts) --- in fact, because of ext3's use of
physical block journaling, it's much more likely that it will recover
from these sorts of errors.  So it's very misleading to have this sort
of discussion in Documentation/filesystems/ext3.txt.

> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> +	   (Note that barriers are disabled by default, use "barrier=1"
> +	   mount option after making sure hw can support them). 

We really should get akpm to agree to accept the patch to default
barriers by default instead.  :-)

							- Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04  2:32 ` Theodore Tso
@ 2009-01-04 22:33   ` Pavel Machek
  2009-01-04 22:34   ` [patch] document ext3 a bit better Pavel Machek
  1 sibling, 0 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-04 22:33 UTC (permalink / raw)
  To: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

Hi!

On Sat 2009-01-03 21:32:11, Theodore Tso wrote:
> On Sat, Jan 03, 2009 at 01:38:15PM +0100, Pavel Machek wrote:
> > +Requirements
> > +============
> > +
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > +behaving disk subsystem, data that have been successfully synced will
> > +stay on the disk. Sane means:
> > +
> > +* writes to media never fail. Even if disk returns error condition during
> > +  write, ext3 can't handle that correctly, because success on fsync was already
> > +  returned when data hit the journal.
> > +
> > +	   (Fortunately writes failing are very uncommon on disks, as they
> > +	   have spare sectors they use when write fails.)
> 
> This is not unique to ext3; per the discussion two weeks ago, this is
> largely because of the fsync() interface not possibly being able to

Ok, so I guess I should split the patch to truly ext3-specific part,
and the part that is common for all the filesystems. I guess I'll need
some help with everything but ext2 and ext3...

> return errors caused by failures when creating or modifying parent
> directories.  Given this, it's a bit misleading to place this in the
> Documentation/filesystems/ext3.txt.  At the minimum it should include
> a discussion about what the issues might be, and given that pretty
> much any Unix/Linux filesystem doesn't have a way of reflecting these
> errors to application programs, it probably should be in a
> filesystem-independent documentation file.

Ok. I'll have to think about good name of that file.

> > +* either write caching is disabled, or hw can do barriers and they are enabled.
> > +
> > +	   (Note that barriers are disabled by default, use "barrier=1"
> > +	   mount option after making sure hw can support them). 
> 
> We really should get akpm to agree to accept the patch to default
> barriers by default instead.  :-)

:-). Yes, that would help a bit.

(No, it is not complete solution. barrier=0/writeback on should be
still documented as unsafe).
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [patch] document ext3 a bit better
  2009-01-04  2:32 ` Theodore Tso
  2009-01-04 22:33   ` Pavel Machek
@ 2009-01-04 22:34   ` Pavel Machek
  2009-01-05 14:57     ` Theodore Tso
  1 sibling, 1 reply; 67+ messages in thread
From: Pavel Machek @ 2009-01-04 22:34 UTC (permalink / raw)
  To: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc


ext3 has quite unexpected semantics or "ro" and defaults are usually
not what they are documented to be, due to mkfs override.

Signed-off-by: Pavel Machek <pavel@suse.cz>

diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..113db1f 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -14,6 +14,11 @@ Options
 When mounting an ext3 filesystem, the following option are accepted:
 (*) == default
 
+ro			Mount filesystem read only. Note that ext3 will replay
+			the journal (and thus write to the partition) even when
+			mounted "read only". "ro, noload" can be used to prevent
+			writes to the filesystem.
+
 journal=update		Update the ext3 file system's journal to the current
 			format.
 
@@ -27,7 +32,9 @@ journal_dev=devnum	When the external jou
 			identified through its new major/minor numbers encoded
 			in devnum.
 
-noload			Don't load the journal on mounting.
+noload			Don't load the journal on mounting. Note that this forces
+			mount of inconsistent filesystem, which can lead to
+			various problems.
 
 data=journal		All data are committed into the journal prior to being
 			written into the main file system.
@@ -95,6 +102,8 @@ debug			Extra debugging information is s
 errors=remount-ro(*)	Remount the filesystem read-only on an error.
 errors=continue		Keep going on a filesystem error.
 errors=panic		Panic and halt the machine if an error occurs.
+			(Note that default is overriden by superblock
+			setting on most systems).
 
 data_err=ignore(*)	Just print an error message if an error occurs
 			in a file data buffer in ordered mode.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [patch] document ext3 a bit better
  2009-01-04 22:34   ` [patch] document ext3 a bit better Pavel Machek
@ 2009-01-05 14:57     ` Theodore Tso
  2009-01-06  9:21       ` Pavel Machek
  0 siblings, 1 reply; 67+ messages in thread
From: Theodore Tso @ 2009-01-05 14:57 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote:
> @@ -14,6 +14,11 @@ Options
>  When mounting an ext3 filesystem, the following option are accepted:
>  (*) == default
>  
> +ro			Mount filesystem read only. Note that ext3 will replay
> +			the journal (and thus write to the partition) even when
> +			mounted "read only". "ro, noload" can be used to prevent
> +			writes to the filesystem.

I'd sugest "ro,noload" since the spaces screw up the mount options
parsing both on the command-line and in /etc/fstab.  So how about:

	Using the mount options "ro,noload" can be used....

> @@ -95,6 +102,8 @@ debug			Extra debugging information is s
>  errors=remount-ro(*)	Remount the filesystem read-only on an error.
>  errors=continue		Keep going on a filesystem error.
>  errors=panic		Panic and halt the machine if an error occurs.
> +			(Note that default is overriden by superblock
> +			setting on most systems).

The default is always specified by the superblock setting.  So users
will probably find it easier to understand if we remove the "(*)" and
to add the explanatory comment:

			(These mount options override the errors behavior
			specified in the superblock, which can be configured
			using tune2fs)

Pavel, thanks for working on improving the documentation; with these
fixes,

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

							- Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch] document ext3 a bit better
  2009-01-05 14:57     ` Theodore Tso
@ 2009-01-06  9:21       ` Pavel Machek
  2009-01-09 23:24         ` Jiri Kosina
  0 siblings, 1 reply; 67+ messages in thread
From: Pavel Machek @ 2009-01-06  9:21 UTC (permalink / raw)
  To: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, Trivial patch monkey

On Mon 2009-01-05 09:57:13, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote:
> > @@ -14,6 +14,11 @@ Options
> >  When mounting an ext3 filesystem, the following option are accepted:
> >  (*) == default
> >  
> > +ro			Mount filesystem read only. Note that ext3 will replay
> > +			the journal (and thus write to the partition) even when
> > +			mounted "read only". "ro, noload" can be used to prevent
> > +			writes to the filesystem.
> 
> I'd sugest "ro,noload" since the spaces screw up the mount options
> parsing both on the command-line and in /etc/fstab.  So how about:
> 
> 	Using the mount options "ro,noload" can be used....

Too many "using", but yes, fixed, thanks.

> > @@ -95,6 +102,8 @@ debug			Extra debugging information is s
> >  errors=remount-ro(*)	Remount the filesystem read-only on an error.
> >  errors=continue		Keep going on a filesystem error.
> >  errors=panic		Panic and halt the machine if an error occurs.
> > +			(Note that default is overriden by superblock
> > +			setting on most systems).
> 
> The default is always specified by the superblock setting.  So users
> will probably find it easier to understand if we remove the "(*)" and
> to add the explanatory comment:
> 
> 			(These mount options override the errors behavior
> 			specified in the superblock, which can be configured
> 			using tune2fs)
> 
> Pavel, thanks for working on improving the documentation; with these
> fixes,

Thanks!

---

ext3 has quite unexpected semantics or "ro" and defaults are
not what they are documented to be, due to mkfs override.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..49c08bf 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -14,6 +14,11 @@ Options
 When mounting an ext3 filesystem, the following option are accepted:
 (*) == default
 
+ro			Mount filesystem read only. Note that ext3 will replay
+			the journal (and thus write to the partition) even when
+			mounted "read only". Mount options "ro,noload" can be
+			used to prevent writes to the filesystem.
+
 journal=update		Update the ext3 file system's journal to the current
 			format.
 
@@ -27,7 +32,9 @@ journal_dev=devnum	When the external jou
 			identified through its new major/minor numbers encoded
 			in devnum.
 
-noload			Don't load the journal on mounting.
+noload			Don't load the journal on mounting. Note that this forces
+			mount of inconsistent filesystem, which can lead to
+			various problems.
 
 data=journal		All data are committed into the journal prior to being
 			written into the main file system.
@@ -92,9 +99,12 @@ nocheck
 
 debug			Extra debugging information is sent to syslog.
 
-errors=remount-ro(*)	Remount the filesystem read-only on an error.
+errors=remount-ro	Remount the filesystem read-only on an error.
 errors=continue		Keep going on a filesystem error.
 errors=panic		Panic and halt the machine if an error occurs.
+			(These mount options override the errors behavior
+			specified in the superblock, which can be 
+			configured using tune2fs.)			
 
 data_err=ignore(*)	Just print an error message if an error occurs
 			in a file data buffer in ordered mode.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [patch] document ext3 a bit better
  2009-01-06  9:21       ` Pavel Machek
@ 2009-01-09 23:24         ` Jiri Kosina
  2009-01-09 23:36           ` Randy Dunlap
  0 siblings, 1 reply; 67+ messages in thread
From: Jiri Kosina @ 2009-01-09 23:24 UTC (permalink / raw)
  To: Pavel Machek, Randy Dunlap, Jonathan Corbet
  Cc: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, linux-doc,
	Trivial patch monkey

On Tue, 6 Jan 2009, Pavel Machek wrote:

> On Mon 2009-01-05 09:57:13, Theodore Tso wrote:
> > On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote:
> > > @@ -14,6 +14,11 @@ Options
> > >  When mounting an ext3 filesystem, the following option are accepted:
> > >  (*) == default
> > >  
> > > +ro			Mount filesystem read only. Note that ext3 will replay
> > > +			the journal (and thus write to the partition) even when
> > > +			mounted "read only". "ro, noload" can be used to prevent
> > > +			writes to the filesystem.
> > 
> > I'd sugest "ro,noload" since the spaces screw up the mount options
> > parsing both on the command-line and in /etc/fstab.  So how about:
> > 
> > 	Using the mount options "ro,noload" can be used....
> 
> Too many "using", but yes, fixed, thanks.
> 
> > > @@ -95,6 +102,8 @@ debug			Extra debugging information is s
> > >  errors=remount-ro(*)	Remount the filesystem read-only on an error.
> > >  errors=continue		Keep going on a filesystem error.
> > >  errors=panic		Panic and halt the machine if an error occurs.
> > > +			(Note that default is overriden by superblock
> > > +			setting on most systems).
> > 
> > The default is always specified by the superblock setting.  So users
> > will probably find it easier to understand if we remove the "(*)" and
> > to add the explanatory comment:
> > 
> > 			(These mount options override the errors behavior
> > 			specified in the superblock, which can be configured
> > 			using tune2fs)
> > 
> > Pavel, thanks for working on improving the documentation; with these
> > fixes,
> 
> Thanks!
> 
> ---
> 
> ext3 has quite unexpected semantics or "ro" and defaults are
> not what they are documented to be, due to mkfs override.
> 
> Signed-off-by: Pavel Machek <pavel@suse.cz>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
> index 9dd2a3b..49c08bf 100644
> --- a/Documentation/filesystems/ext3.txt
> +++ b/Documentation/filesystems/ext3.txt
> @@ -14,6 +14,11 @@ Options
>  When mounting an ext3 filesystem, the following option are accepted:
>  (*) == default
>  
> +ro			Mount filesystem read only. Note that ext3 will replay
> +			the journal (and thus write to the partition) even when
> +			mounted "read only". Mount options "ro,noload" can be
> +			used to prevent writes to the filesystem.
> +
>  journal=update		Update the ext3 file system's journal to the current
>  			format.
>  
> @@ -27,7 +32,9 @@ journal_dev=devnum	When the external jou
>  			identified through its new major/minor numbers encoded
>  			in devnum.
>  
> -noload			Don't load the journal on mounting.
> +noload			Don't load the journal on mounting. Note that this forces
> +			mount of inconsistent filesystem, which can lead to
> +			various problems.
>  
>  data=journal		All data are committed into the journal prior to being
>  			written into the main file system.
> @@ -92,9 +99,12 @@ nocheck
>  
>  debug			Extra debugging information is sent to syslog.
>  
> -errors=remount-ro(*)	Remount the filesystem read-only on an error.
> +errors=remount-ro	Remount the filesystem read-only on an error.
>  errors=continue		Keep going on a filesystem error.
>  errors=panic		Panic and halt the machine if an error occurs.
> +			(These mount options override the errors behavior
> +			specified in the superblock, which can be 
> +			configured using tune2fs.)			
>  
>  data_err=ignore(*)	Just print an error message if an error occurs
>  			in a file data buffer in ordered mode.
> 

So, documentation guys, are you going to take this patch through the 
Documentation tree (tytso already Signed off on that), or should I take it 
through trivial tree?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch] document ext3 a bit better
  2009-01-09 23:24         ` Jiri Kosina
@ 2009-01-09 23:36           ` Randy Dunlap
  2009-01-09 23:47             ` Jiri Kosina
  0 siblings, 1 reply; 67+ messages in thread
From: Randy Dunlap @ 2009-01-09 23:36 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Pavel Machek, Jonathan Corbet, Theodore Tso, kernel list,
	Andrew Morton, mtk.manpages, linux-doc, Trivial patch monkey

Jiri Kosina wrote:
> On Tue, 6 Jan 2009, Pavel Machek wrote:
> 
>> On Mon 2009-01-05 09:57:13, Theodore Tso wrote:
>>> On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote:
>>>> @@ -14,6 +14,11 @@ Options
>>>>  When mounting an ext3 filesystem, the following option are accepted:
>>>>  (*) == default
>>>>  
>>>> +ro			Mount filesystem read only. Note that ext3 will replay
>>>> +			the journal (and thus write to the partition) even when
>>>> +			mounted "read only". "ro, noload" can be used to prevent
>>>> +			writes to the filesystem.
>>> I'd sugest "ro,noload" since the spaces screw up the mount options
>>> parsing both on the command-line and in /etc/fstab.  So how about:
>>>
>>> 	Using the mount options "ro,noload" can be used....
>> Too many "using", but yes, fixed, thanks.
>>
>>>> @@ -95,6 +102,8 @@ debug			Extra debugging information is s
>>>>  errors=remount-ro(*)	Remount the filesystem read-only on an error.
>>>>  errors=continue		Keep going on a filesystem error.
>>>>  errors=panic		Panic and halt the machine if an error occurs.
>>>> +			(Note that default is overriden by superblock
>>>> +			setting on most systems).
>>> The default is always specified by the superblock setting.  So users
>>> will probably find it easier to understand if we remove the "(*)" and
>>> to add the explanatory comment:
>>>
>>> 			(These mount options override the errors behavior
>>> 			specified in the superblock, which can be configured
>>> 			using tune2fs)
>>>
>>> Pavel, thanks for working on improving the documentation; with these
>>> fixes,
>> Thanks!
>>
>> ---
>>
>> ext3 has quite unexpected semantics or "ro" and defaults are
>> not what they are documented to be, due to mkfs override.
>>
>> Signed-off-by: Pavel Machek <pavel@suse.cz>
>> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>>
>> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
>> index 9dd2a3b..49c08bf 100644
>> --- a/Documentation/filesystems/ext3.txt
>> +++ b/Documentation/filesystems/ext3.txt
>> @@ -14,6 +14,11 @@ Options
>>  When mounting an ext3 filesystem, the following option are accepted:
>>  (*) == default
>>  
>> +ro			Mount filesystem read only. Note that ext3 will replay
>> +			the journal (and thus write to the partition) even when
>> +			mounted "read only". Mount options "ro,noload" can be
>> +			used to prevent writes to the filesystem.
>> +
>>  journal=update		Update the ext3 file system's journal to the current
>>  			format.
>>  
>> @@ -27,7 +32,9 @@ journal_dev=devnum	When the external jou
>>  			identified through its new major/minor numbers encoded
>>  			in devnum.
>>  
>> -noload			Don't load the journal on mounting.
>> +noload			Don't load the journal on mounting. Note that this forces
>> +			mount of inconsistent filesystem, which can lead to
>> +			various problems.
>>  
>>  data=journal		All data are committed into the journal prior to being
>>  			written into the main file system.
>> @@ -92,9 +99,12 @@ nocheck
>>  
>>  debug			Extra debugging information is sent to syslog.
>>  
>> -errors=remount-ro(*)	Remount the filesystem read-only on an error.
>> +errors=remount-ro	Remount the filesystem read-only on an error.
>>  errors=continue		Keep going on a filesystem error.
>>  errors=panic		Panic and halt the machine if an error occurs.
>> +			(These mount options override the errors behavior
>> +			specified in the superblock, which can be 
>> +			configured using tune2fs.)			
>>  
>>  data_err=ignore(*)	Just print an error message if an error occurs
>>  			in a file data buffer in ordered mode.
>>
> 
> So, documentation guys, are you going to take this patch through the 
> Documentation tree (tytso already Signed off on that), or should I take it 

                     (probably should be Acked-by or Reviewed-by
                      if he isn't merging it)

> through trivial tree?

I'm so far behind on doc patches that I haven't read any of this thread
yet, so you can merge it IMO.

Thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch] document ext3 a bit better
  2009-01-09 23:36           ` Randy Dunlap
@ 2009-01-09 23:47             ` Jiri Kosina
  0 siblings, 0 replies; 67+ messages in thread
From: Jiri Kosina @ 2009-01-09 23:47 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Pavel Machek, Jonathan Corbet, Theodore Tso, kernel list,
	Andrew Morton, mtk.manpages, linux-doc, Trivial patch monkey

On Fri, 9 Jan 2009, Randy Dunlap wrote:

> I'm so far behind on doc patches that I haven't read any of this thread 
> yet, so you can merge it IMO.

OK, I have applied it, thanks Pavel.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 12:38 document ext3 requirements Pavel Machek
  2009-01-03 21:17 ` Martin MOKREJŠ
  2009-01-04  2:32 ` Theodore Tso
@ 2009-01-04 13:35 ` Alexander E. Patrakov
  2009-01-04 13:53   ` Valdis.Kletnieks
                     ` (3 more replies)
  2009-01-04 19:49 ` Rob Landley
  3 siblings, 4 replies; 67+ messages in thread
From: Alexander E. Patrakov @ 2009-01-04 13:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc, Alan Cox

Pavel Machek wrote:
[CC: Alan Cox because of his reply in the "XFS internal error" thread]

> Using ext3 is only safe if storage subsystem meets certain
> criteria. Document those.

Thanks for this patch. However, after reading this, I have a stupid 
question: which file system should I use if I had to reinstall my 
computers from scratch now?

Ext3 means either hardware that supports barriers (not sure how to 
check, and anyway I have to use encryption on the work laptop due to the 
corporate policy) or disabling write cache (but, as Alan Cox said, this 
shortens the lifespan of the disk). Does this requirement apply to other 
journaling filesystems? Do I need journaling at all, given that I have 
an UPS on my desktop and a battery in the laptop?

-- 
Alexander E. Patrakov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov
@ 2009-01-04 13:53   ` Valdis.Kletnieks
  2009-01-04 18:21   ` Michael Tokarev
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 67+ messages in thread
From: Valdis.Kletnieks @ 2009-01-04 13:53 UTC (permalink / raw)
  To: Alexander E. Patrakov
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc, Alan Cox

[-- Attachment #1: Type: text/plain, Size: 1097 bytes --]

On Sun, 04 Jan 2009 18:35:41 +0500, "Alexander E. Patrakov" said:

> Ext3 means either hardware that supports barriers (not sure how to 
> check, and anyway I have to use encryption on the work laptop due to the 
> corporate policy) or disabling write cache (but, as Alan Cox said, this 
> shortens the lifespan of the disk).

False dichotomy.  This isn't an "either/or", as there's a *third* case:

"understand the issues and risks involved if you have a write cache and
no barrier support, and learn to deal with it".

As you point out, if it's a laptop with a battery, the risk may be *very* low.
Let's say there's a 1 in 10,000 chance that you'll trash a file system and
need to restore from backups.

That may be totally acceptable if you've already estimated a 1 in 500 chance
of the whole damned laptop going walkies while you're not looking, and then
you *still* need to be able to restore from backups onto a replacement machine.

Yes, for some systems, the whole barriers/write cache thing is in fact very
important.  But for others, data loss due to spilled coffee is a bigger worry...

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov
  2009-01-04 13:53   ` Valdis.Kletnieks
@ 2009-01-04 18:21   ` Michael Tokarev
  2009-01-04 18:38   ` Theodore Tso
  2009-01-04 20:10   ` Pavel Machek
  3 siblings, 0 replies; 67+ messages in thread
From: Michael Tokarev @ 2009-01-04 18:21 UTC (permalink / raw)
  To: Alexander E. Patrakov
  Cc: Pavel Machek, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc, Alan Cox

Alexander E. Patrakov wrote:
[]
> Ext3 means either hardware that supports barriers (not sure how to
> check, and anyway I have to use encryption on the work laptop due to the
> corporate policy) or disabling write cache (but, as Alan Cox said, this
> shortens the lifespan of the disk). Does this requirement apply to other
> journaling filesystems? Do I need journaling at all, given that I have
> an UPS on my desktop and a battery in the laptop?

There's another possibility too, somewhat more risky.  Namely, run with
write cache ON by default, and switch it off when running off battery
(either UPS or notebook).  Should save both worlds, PROVIDED the battery
actually/UPS works :)

/mjt



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov
  2009-01-04 13:53   ` Valdis.Kletnieks
  2009-01-04 18:21   ` Michael Tokarev
@ 2009-01-04 18:38   ` Theodore Tso
  2009-01-04 22:37     ` Pavel Machek
  2009-01-05 11:43     ` Alan Cox
  2009-01-04 20:10   ` Pavel Machek
  3 siblings, 2 replies; 67+ messages in thread
From: Theodore Tso @ 2009-01-04 18:38 UTC (permalink / raw)
  To: Alexander E. Patrakov
  Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, Alan Cox

On Sun, Jan 04, 2009 at 06:35:41PM +0500, Alexander E. Patrakov wrote:
>
> Ext3 means either hardware that supports barriers (not sure how to  
> check

Pretty much all modern disk drives supports barriers.  And note that
w/o barriers ext3 has worked pretty well.  *If* you have a workload
pushes your system into a mode which where it is very low on memory,
so it is constantly paging/thrashing and you have a workload which is
metadata intensive, and you crash the machine while it is thrashing,
it is possible to end up in a situation where your filesystem is
corrupted and you have to use e2fsck to correct the filesystem.  In
practice this is often not the case, which is why the default for ext3
has been with barriers disabled, and most people have not noted major
problems.  This is why Andrew Morton has refused accept the patch for
ext3 which disables barriers by default; he's not convinced the
performance hit is worth the improvement in reliability.

Ext4 does enable barriers by defaults, mainly because filesystem
developers tend to be believe the reliability is more important than
performance.  (On the other hand, Google runs with ext2 w/o
journalling, because everything is replicated three times and it's
easier to just blow away the filesystem and resync from one of the
duplicate copies; so in the right circumstances, maybe worrying only
about performance and ignoring reliability makes perfect sense.)

> and anyway I have to use encryption on the work laptop due to the  
> corporate policy

If dm supported barriers, this wouldn't be an issue.  Personally, I
find the convenience of LVM is so useful that I use ext4 with LVM,
even though the barrier requests get dropped on the ground.  And I'm a
kernel developer, and I use a laptop with suspend/resume, which means
I often crash uncleanly --- and I've not lost data yet, despite the
lack of barriers.  (On the other hand, my laptop has 4 gigs of memory,
so I'm rarely thrashing due memory pressure.)

> or disabling write cache (but, as Alan Cox said, this  
> shortens the lifespan of the disk).

Huh?  I've never heard an assertion that disabling the write cache (I
assume you mean using write-through caching as opposed to write-back
caching), shortens the lifespan of disk drives.  Aggressive battery
saving mode is far more likely to shorten disk drive life, due to
spinning the platters up and down a lot.   

> Does this requirement apply to other  
> journaling filesystems? Do I need journaling at all, given that I have  
> an UPS on my desktop and a battery in the laptop?

Which requirement?  Barriers?  Most journaling filesystems simply
enable barriers by default.  

And journalling is useful so that if your system crashes, say due to
suspend and resume not working out, or the battery runs dry without
your noticing it, you can avoid running fsck at boot time.  It's
really more about shorting the boot time after a crash more than
anything else.

					- Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 18:38   ` Theodore Tso
@ 2009-01-04 22:37     ` Pavel Machek
  2009-01-04 23:58       ` Theodore Tso
  2009-01-05 11:43     ` Alan Cox
  1 sibling, 1 reply; 67+ messages in thread
From: Pavel Machek @ 2009-01-04 22:37 UTC (permalink / raw)
  To: Theodore Tso, Alexander E. Patrakov, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, Alan Cox

On Sun 2009-01-04 13:38:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 06:35:41PM +0500, Alexander E. Patrakov wrote:
> >
> > Ext3 means either hardware that supports barriers (not sure how to  
> > check
> 
> Pretty much all modern disk drives supports barriers.  And note that
> w/o barriers ext3 has worked pretty well.  *If* you have a workload
> pushes your system into a mode which where it is very low on memory,
> so it is constantly paging/thrashing and you have a workload which is
> metadata intensive, and you crash the machine while it is thrashing,
> it is possible to end up in a situation where your filesystem is
> corrupted and you have to use e2fsck to correct the filesystem.  In

Are you sure you need to have thrashing? AFAICT metadata + fsync heavy
workload should be enough... and there were scripts to easily repeat
that.

> > Does this requirement apply to other  
> > journaling filesystems? Do I need journaling at all, given that I have  
> > an UPS on my desktop and a battery in the laptop?
> 
> Which requirement?  Barriers?  Most journaling filesystems simply
> enable barriers by default.  
> 
> And journalling is useful so that if your system crashes, say due to
> suspend and resume not working out, or the battery runs dry without
> your noticing it, you can avoid running fsck at boot time.  It's
> really more about shorting the boot time after a crash more than
> anything else.

Actually, journalling with barriers=0 should still be "safe" in case of
kernel crashes (*), right? Because if just kernel is dead, disk
firmware will still write the cache back, AFAICT.
									Pavel

(*) kernel crashes that do not involve writing random garbage to disk.
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:37     ` Pavel Machek
@ 2009-01-04 23:58       ` Theodore Tso
  0 siblings, 0 replies; 67+ messages in thread
From: Theodore Tso @ 2009-01-04 23:58 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alexander E. Patrakov, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, Alan Cox

On Sun, Jan 04, 2009 at 11:37:56PM +0100, Pavel Machek wrote:
> 
> Are you sure you need to have thrashing? AFAICT metadata + fsync heavy
> workload should be enough... and there were scripts to easily repeat
> that.

The memory pressure is needed to force disk buffers out to disk sooner
than fsync() would normally force buffers out.  The scripts which I've
seen induced memory pressure.  If the disk is *super* aggressive at
reordering writes, I suppose a heavy fsync workload might be enough on
its own, but in practice, it's generally not enough.

							- Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 18:38   ` Theodore Tso
  2009-01-04 22:37     ` Pavel Machek
@ 2009-01-05 11:43     ` Alan Cox
  2009-01-07 11:59       ` Rob Landley
  1 sibling, 1 reply; 67+ messages in thread
From: Alan Cox @ 2009-01-05 11:43 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Alexander E. Patrakov, Pavel Machek, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

> If dm supported barriers, this wouldn't be an issue.  Personally, I

"If the dm people applied the patches to support barriers" I believe is
the correct description - Andi ? 

dm and md want fixing and even in the md case it isn't hard to do right.

> > or disabling write cache (but, as Alan Cox said, this  
> > shortens the lifespan of the disk).
> 
> Huh?  I've never heard an assertion that disabling the write cache (I
> assume you mean using write-through caching as opposed to write-back
> caching), shortens the lifespan of disk drives.  Aggressive battery

Thats what I was told by a disk vendor - simply because the drive makes a
lot more mechanical movements and writes.

> your noticing it, you can avoid running fsck at boot time.  It's
> really more about shorting the boot time after a crash more than
> anything else.

That depends enormously on your environment. In a secure environment full
data journalling is practically essential to avoid the tiny risk of bits
of important data turning up in another users file.

Alan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-05 11:43     ` Alan Cox
@ 2009-01-07 11:59       ` Rob Landley
  0 siblings, 0 replies; 67+ messages in thread
From: Rob Landley @ 2009-01-07 11:59 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Tso, Alexander E. Patrakov, Pavel Machek, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc

On Monday 05 January 2009 05:43:29 Alan Cox wrote:
> > Huh?  I've never heard an assertion that disabling the write cache (I
> > assume you mean using write-through caching as opposed to write-back
> > caching), shortens the lifespan of disk drives.  Aggressive battery
>
> Thats what I was told by a disk vendor - simply because the drive makes a
> lot more mechanical movements and writes.

It certainly sounds like less write cacheing would shorten the lifespan of 
flash devices...

Rob

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov
                     ` (2 preceding siblings ...)
  2009-01-04 18:38   ` Theodore Tso
@ 2009-01-04 20:10   ` Pavel Machek
  3 siblings, 0 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-04 20:10 UTC (permalink / raw)
  To: Alexander E. Patrakov
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc, Alan Cox

On Sun 2009-01-04 18:35:41, Alexander E. Patrakov wrote:
> Pavel Machek wrote:
> [CC: Alan Cox because of his reply in the "XFS internal error" thread]
>
>> Using ext3 is only safe if storage subsystem meets certain
>> criteria. Document those.
>
> Thanks for this patch. However, after reading this, I have a stupid  
> question: which file system should I use if I had to reinstall my  
> computers from scratch now?

ext2 is still the safest default... if you can live with fsck.

ext3 is the safest from the journalling ones, AFAICT.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-03 12:38 document ext3 requirements Pavel Machek
                   ` (2 preceding siblings ...)
  2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov
@ 2009-01-04 19:49 ` Rob Landley
  2009-01-04 22:06   ` Theodore Tso
  2009-01-04 22:55   ` Pavel Machek
  3 siblings, 2 replies; 67+ messages in thread
From: Rob Landley @ 2009-01-04 19:49 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc

On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* writes to media never fail. Even if disk returns error condition during
> +  write, ext3 can't handle that correctly, because success on fsync was
> already +  returned when data hit the journal.
> +
> +	   (Fortunately writes failing are very uncommon on disks, as they
> +	   have spare sectors they use when write fails.)
> +
> +* either whole sector is correctly written or nothing is written during
> +  powerfail.
> +
> +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
> +	   like this, and are unsuitable for ext3.

Want to document the granularity issues with flash, while you're at it?

An inherent problem with using flash as a normal block device is that the 
flash erase size is bigger than most filesystem sector sizes.  So when you 
request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
megabytes on the really _big_ ones.

If you lose power in the middle of that, ext3 won't notice that data in the 
"sectors" _after_ the one your were trying to write to got trashed.

The flash filesystems take this into account as part of their wear levelling 
stuff (they normally copy the entire chunk into a new chunk, leaving the old 
one in place until it's no longer needed), but they need to query the device 
to get the erase granularity in order to do that, which is why they don't work 
on non-flash block devices.

Rob

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:49 ` Rob Landley
@ 2009-01-04 22:06   ` Theodore Tso
  2009-01-04 22:25     ` Pavel Machek
                       ` (3 more replies)
  2009-01-04 22:55   ` Pavel Machek
  1 sibling, 4 replies; 67+ messages in thread
From: Theodore Tso @ 2009-01-04 22:06 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> 
> Want to document the granularity issues with flash, while you're at it?
> 
> An inherent problem with using flash as a normal block device is that the 
> flash erase size is bigger than most filesystem sector sizes.  So when you 
> request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
> megabytes on the really _big_ ones.
> 
> If you lose power in the middle of that, ext3 won't notice that data in the 
> "sectors" _after_ the one your were trying to write to got trashed.

True enough, although the newer SSD's will have this problem addressed
(although at least initially, they are **far** more costly than the
el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
alongside battery-powered shavers and trashy ipod speakers).

I will stress again, that most of this doesn't belong in
Documentation/filesystems/ext3.txt, as most of this is *not*
ext3-specific.

						- Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:06   ` Theodore Tso
@ 2009-01-04 22:25     ` Pavel Machek
  2009-01-04 23:00     ` [patch] " Pavel Machek
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-04 22:25 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Sun 2009-01-04 17:06:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> > 
> > Want to document the granularity issues with flash, while you're at it?
> > 
> > An inherent problem with using flash as a normal block device is that the 
> > flash erase size is bigger than most filesystem sector sizes.  So when you 
> > request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
> > megabytes on the really _big_ ones.
> > 
> > If you lose power in the middle of that, ext3 won't notice that data in the 
> > "sectors" _after_ the one your were trying to write to got trashed.
> 
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).
> 
> I will stress again, that most of this doesn't belong in
> Documentation/filesystems/ext3.txt, as most of this is *not*
> ext3-specific.

I've initially done the patch for ext3 because that's what I'm using
and becuase I felt responsible for documenting it after a huge thread.

At least barrier=1 seems to be ext3 specific, and perhaps logfs or
something can survive full eraseblocks disappearing. Anyway, i guess
we all agree that this needs to be documented _somewhere_, and that's
what I'm trying to do.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [patch] Re: document ext3 requirements
  2009-01-04 22:06   ` Theodore Tso
  2009-01-04 22:25     ` Pavel Machek
@ 2009-01-04 23:00     ` Pavel Machek
  2009-01-05  2:42       ` Rob Landley
  2009-01-04 23:07     ` Pavel Machek
  2009-01-05  1:38     ` Rob Landley
  3 siblings, 1 reply; 67+ messages in thread
From: Pavel Machek @ 2009-01-04 23:00 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Sun 2009-01-04 17:06:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> > 
> > Want to document the granularity issues with flash, while you're at it?
> > 
> > An inherent problem with using flash as a normal block device is that the 
> > flash erase size is bigger than most filesystem sector sizes.  So when you 
> > request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
> > megabytes on the really _big_ ones.
> > 
> > If you lose power in the middle of that, ext3 won't notice that data in the 
> > "sectors" _after_ the one your were trying to write to got trashed.
> 
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).
> 
> I will stress again, that most of this doesn't belong in
> Documentation/filesystems/ext3.txt, as most of this is *not*
> ext3-specific.

Agreed... So what about this one?

---

Document linux filesystem expectations. Ext3 can't handle write errors
of any kind, and can't handle non-atomic sector writes. Other
filesystems are probably even worse...

Signed-off-by: Pavel Machek <pavel@suse.cz>

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..7817a9c
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,44 @@
+Linux filesystems can only work correctly when several conditions are
+met in the block layer and below (disks, flash cards). Some of them
+are obvious ("data on media should not change randomly"), some are
+less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Unfortuantely, none of the cheap USB/SD flash cards I seen do 
+	behave like this, and are unsuitable for all linux filesystems 
+	I know. 
+
+		An inherent problem with using flash as a normal block
+		device is that the flash erase size is bigger than
+		most filesystem sector sizes.  So when you request a
+		write, it may erase and rewrite the next 64k, 128k, or
+		even a couple megabytes on the really _big_ ones.
+
+		If you lose power in the middle of that, filesystem
+		won't notice that data in the "sectors" _after_ the
+		one your were trying to write to got trashed.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be neccessary;
+	otherwise, disks may write garbage during powerfail.
+	Not sure how common that problem is on generic PC machines.
+
+
+
+
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..8cb64b0 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +197,25 @@ mke2fs: 	create a ext3 partition with th
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux filesystems have similar
+expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
 
 References
 ==========


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [patch] Re: document ext3 requirements
  2009-01-04 23:00     ` [patch] " Pavel Machek
@ 2009-01-05  2:42       ` Rob Landley
  2009-01-05  9:54         ` Pavel Machek
  0 siblings, 1 reply; 67+ messages in thread
From: Rob Landley @ 2009-01-05  2:42 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sunday 04 January 2009 17:00:53 Pavel Machek wrote:
> Document linux filesystem expectations. Ext3 can't handle write errors
> of any kind, and can't handle non-atomic sector writes. Other
> filesystems are probably even worse...

These concerns look like they're specifically for block backed filesystems, 
which is one of four different types.  I wrote a longish incoherent rant to 
the busybox list about the different types of filesystems a couple months 
back, in the context of a thread about implementing the "mount" command.  
Dunno how relevant it is:
http://lists.busybox.net/pipermail/busybox/2008-November/067970.html

There are a couple fun relevant corner cases, such as the fact that nfs is the 
only filesystem I'm aware of where the return value of close() can actually 
mean something.  (Due to the cacheing, you tend to get errors reported 
_there_.  I don't remember why, if I ever knew.)

> Signed-off-by: Pavel Machek <pavel@suse.cz>
>
> diff --git a/Documentation/filesystems/expectations.txt
> b/Documentation/filesystems/expectations.txt new file mode 100644
> index 0000000..7817a9c
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,44 @@
> +Linux filesystems can only work correctly when several conditions are
> +met in the block layer and below (disks, flash cards). Some of them
> +are obvious ("data on media should not change randomly"), some are
> +less so.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly, because success
> +on fsync was already returned when data hit the journal.
> +
> +	Fortunately writes failing are very uncommon on traditional
> +	spinning disks, as they have spare sectors they use when write
> +	fails.

The failures show up in dmesg(), and some filesystems will remount themselves 
read only if the physical media driver manages to propogate an error back to 
to the filesystem.  (Note that the scsi subsystem has historically had so many 
glue layers that it couldn't manage to do this; that's been improved over the 
years but whether or not it actually _works_ now, I couldn't tell you.)

Some kind of system monitor could notice the dmesg entries, but the actual 
write goes into the cache and the physical media error normally happens long 
after the program that did the write returned from its write call, often after 
it closed the file, and sometimes after it exited.

Even sync() and fsync() won't help you there because if multiple processes do 
that, only the _first_ one will get the physical media error.  (The filesystem 
doesn't associate physical media errors with processes; there's too many 
layers in between and it's not necessarily a 1:1 relationship anyway.)

> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +	Unfortuantely, none of the cheap USB/SD flash cards I seen do
> +	behave like this, and are unsuitable for all linux filesystems
> +	I know.

My impression is you might as well leave the suckers vfat.  It's a stupid 
little filesystem but its very stupidity makes it as resistant to damage as 
anything else (which admittedly isn't much), and it's had such a history of 
_taking_ damage that the tools to cope with damage to it are actually pretty 
good.

That said, constant updates to the first few sectors will burn out your USB 
flash disk if you use it as something other than a backup media.  That's true 
with a lot of filesystems.  (Hardware wear levelling isn't very good, it 
cycles between the same dozen or so physical sectors for each logical sector.)

In general, those game consoles that say "please don't power off the thing 
while we're writing to flash" have a reason for the message. :)

> +		An inherent problem with using flash as a normal block
> +		device is that the flash erase size is bigger than
> +		most filesystem sector sizes.  So when you request a
> +		write, it may erase and rewrite the next 64k, 128k, or
> +		even a couple megabytes on the really _big_ ones.
> +
> +		If you lose power in the middle of that, filesystem
> +		won't notice that data in the "sectors" _after_ the
> +		one your were trying to write to got trashed.
> +
> +	Because RAM tends to fail faster than rest of system during
> +	powerfail, special hw killing DMA transfers may be neccessary;
> +	otherwise, disks may write garbage during powerfail.
> +	Not sure how common that problem is on generic PC machines.
> +
> +
> +
> +
> diff --git a/Documentation/filesystems/ext3.txt
> b/Documentation/filesystems/ext3.txt index 9dd2a3b..8cb64b0 100644
> --- a/Documentation/filesystems/ext3.txt
> +++ b/Documentation/filesystems/ext3.txt
> @@ -188,6 +197,25 @@ mke2fs: 	create a ext3 partition with th
>  debugfs: 	ext2 and ext3 file system debugger.
>  ext2online:	online (mounted) ext2 and ext3 filesystem resizer
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed
> +
> +* sector writes are atomic
> +
> +(see expectations.txt; note that most/all linux filesystems have similar
> +expectations)

nfs, cifs, procfs, sysfs, usbfs, tmpfs, ramfs, fuse...

> +* either write caching is disabled, or hw can do barriers and they are
> enabled. +
> +	   (Note that barriers are disabled by default, use "barrier=1"
> +	   mount option after making sure hw can support them).

So how does one make sure hw can support them?

Rob

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch] Re: document ext3 requirements
  2009-01-05  2:42       ` Rob Landley
@ 2009-01-05  9:54         ` Pavel Machek
  0 siblings, 0 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-05  9:54 UTC (permalink / raw)
  To: Rob Landley
  Cc: Theodore Tso, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

> On Sunday 04 January 2009 17:00:53 Pavel Machek wrote:
> > Document linux filesystem expectations. Ext3 can't handle write errors
> > of any kind, and can't handle non-atomic sector writes. Other
> > filesystems are probably even worse...
> 
> These concerns look like they're specifically for block backed filesystems, 
> which is one of four different types.  I wrote a longish incoherent
> rant to 

I updated the docs. It now states "block-backed filesystems" in the
first sentence.

> > +Write errors not allowed (NO-WRITE-ERRORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Writes to media never fail. Even if disk returns error condition
> > +during write, filesystems can't handle that correctly, because success
> > +on fsync was already returned when data hit the journal.
> > +
> > +	Fortunately writes failing are very uncommon on traditional
> > +	spinning disks, as they have spare sectors they use when write
> > +	fails.
> 
> The failures show up in dmesg(), and some filesystems will remount themselves 
> read only if the physical media driver manages to propogate an error back to 
> to the filesystem.  (Note that the scsi subsystem has historically

Well, you may get an error in dmesg(), but your data are already gone
at that point (and  apps don't read dmesg, anyway :-).

> Even sync() and fsync() won't help you there because if multiple processes do 
> that, only the _first_ one will get the physical media error.  (The filesystem 
> doesn't associate physical media errors with processes; there's too many 
> layers in between and it's not necessarily a 1:1 relationship
> anyway.)

sync() does not even have return value.

Yep. I'm trying to get fsync manpage updated.

> > +* either write caching is disabled, or hw can do barriers and they are
> > enabled. +
> > +	   (Note that barriers are disabled by default, use "barrier=1"
> > +	   mount option after making sure hw can support them).
> 
> So how does one make sure hw can support them?

hdparm -I reports them. If you don't see "Native Command Queueing",
you have a problem.

Interestingly, neither x60 notebook not pretty recent amd workstation
has NCQ... Amd notebook seems to be ok.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:06   ` Theodore Tso
  2009-01-04 22:25     ` Pavel Machek
  2009-01-04 23:00     ` [patch] " Pavel Machek
@ 2009-01-04 23:07     ` Pavel Machek
  2009-01-05  1:38     ` Rob Landley
  3 siblings, 0 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-04 23:07 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Sun 2009-01-04 17:06:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> > 
> > Want to document the granularity issues with flash, while you're at it?
> > 
> > An inherent problem with using flash as a normal block device is that the 
> > flash erase size is bigger than most filesystem sector sizes.  So when you 
> > request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
> > megabytes on the really _big_ ones.
> > 
> > If you lose power in the middle of that, ext3 won't notice that data in the 
> > "sectors" _after_ the one your were trying to write to got trashed.
> 
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).

Hey, I got one of those el-cheapo 32GB SD cards. I fully expected it
to be slow, but eating my data 3 times per month was unexpected even
for me.

I'm not even sure where the blame is. I certainly blame the Linux
documentation: there should be "DON'T USE CRAPPY SD CARDS" warning in
big bold letters somewhere. I guess mkfs.ext3 should just refuse to
make filesystem on them. (Of course, the manufacturer should have told
me that the card is crap; I can bet it can not even work with
VFAT/Windows).

Plus I'd hope some filesystem materializes that can handle 128KB
"block size"... because the el-cheapo card I have here is actually
pretty sane. It seems to store data I put on it, and should be safe to
use with huge block size...  

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:06   ` Theodore Tso
                       ` (2 preceding siblings ...)
  2009-01-04 23:07     ` Pavel Machek
@ 2009-01-05  1:38     ` Rob Landley
  3 siblings, 0 replies; 67+ messages in thread
From: Rob Landley @ 2009-01-05  1:38 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc

On Sunday 04 January 2009 16:06:34 Theodore Tso wrote:
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).

I have great faith in the ability of PC hardware to continue to be crap for 
the foreseeable future.

> I will stress again, that most of this doesn't belong in
> Documentation/filesystems/ext3.txt, as most of this is *not*
> ext3-specific.

Yes and no.  Ext3 is enough of a "default" filesystem for Linux that some 
documentation on when _not_ to use sounds like a good idea.

That said, some kind of a "choosing a filesystem" file would be good, perhaps 
under the filesystems directory.  (Then the ext3 doc would just need a brief 
comment and a pointer to the other file.)

Rob

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 19:49 ` Rob Landley
  2009-01-04 22:06   ` Theodore Tso
@ 2009-01-04 22:55   ` Pavel Machek
  2009-01-05  0:16     ` david
                       ` (2 more replies)
  1 sibling, 3 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-04 22:55 UTC (permalink / raw)
  To: Rob Landley
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc

On Sun 2009-01-04 13:49:49, Rob Landley wrote:
> On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > +behaving disk subsystem, data that have been successfully synced will
> > +stay on the disk. Sane means:
> > +
> > +* writes to media never fail. Even if disk returns error condition during
> > +  write, ext3 can't handle that correctly, because success on fsync was
> > already +  returned when data hit the journal.
> > +
> > +	   (Fortunately writes failing are very uncommon on disks, as they
> > +	   have spare sectors they use when write fails.)
> > +
> > +* either whole sector is correctly written or nothing is written during
> > +  powerfail.
> > +
> > +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
> > +	   like this, and are unsuitable for ext3.
> 
> Want to document the granularity issues with flash, while you're at it?
> 
> An inherent problem with using flash as a normal block device is that the 
> flash erase size is bigger than most filesystem sector sizes.  So when you 
> request a write, it may erase and rewrite the next 64k, 128k, or even a couple 
> megabytes on the really _big_ ones.
> 
> If you lose power in the middle of that, ext3 won't notice that data in the 
> "sectors" _after_ the one your were trying to write to got trashed.
> 
> The flash filesystems take this into account as part of their wear levelling 
> stuff (they normally copy the entire chunk into a new chunk, leaving the old 
> one in place until it's no longer needed), but they need to query the device 
> to get the erase granularity in order to do that, which is why they don't work 
> on non-flash block devices.

Is there linux filesystem that can handle that? I know jffs2, but
that's unsuitable for stuff like USB thumb drives, right?

Does this sound like a fair summary?

Sector writes are atomic (ATOMIC-SECTORS)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Either whole sector is correctly written or nothing is written during
powerfail.

        Unfortuantely, none of the cheap USB/SD flash cards I seen do
        behave like this, and are unsuitable for all linux filesystems
        I know.

                An inherent problem with using flash as a normal block
                device is that the flash erase size is bigger than
                most filesystem sector sizes.  So when you request a
                write, it may erase and rewrite the next 64k, 128k, or
                even a couple megabytes on the really _big_ ones.

                If you lose power in the middle of that, filesystem
                won't notice that data in the "sectors" _after_ the
                one your were trying to write to got trashed.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:55   ` Pavel Machek
@ 2009-01-05  0:16     ` david
  2009-01-05  9:38       ` Pavel Machek
  2009-01-05  1:50     ` Rob Landley
  2009-01-05  3:20     ` Martin K. Petersen
  2 siblings, 1 reply; 67+ messages in thread
From: david @ 2009-01-05  0:16 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

On Sun, 4 Jan 2009, Pavel Machek wrote:

> On Sun 2009-01-04 13:49:49, Rob Landley wrote:
>> On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
>>> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
>>> +behaving disk subsystem, data that have been successfully synced will
>>> +stay on the disk. Sane means:
>>> +
>>> +* writes to media never fail. Even if disk returns error condition during
>>> +  write, ext3 can't handle that correctly, because success on fsync was
>>> already +  returned when data hit the journal.
>>> +
>>> +	   (Fortunately writes failing are very uncommon on disks, as they
>>> +	   have spare sectors they use when write fails.)
>>> +
>>> +* either whole sector is correctly written or nothing is written during
>>> +  powerfail.
>>> +
>>> +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
>>> +	   like this, and are unsuitable for ext3.
>>
>> Want to document the granularity issues with flash, while you're at it?
>>
>> An inherent problem with using flash as a normal block device is that the
>> flash erase size is bigger than most filesystem sector sizes.  So when you
>> request a write, it may erase and rewrite the next 64k, 128k, or even a couple
>> megabytes on the really _big_ ones.
>>
>> If you lose power in the middle of that, ext3 won't notice that data in the
>> "sectors" _after_ the one your were trying to write to got trashed.
>>
>> The flash filesystems take this into account as part of their wear levelling
>> stuff (they normally copy the entire chunk into a new chunk, leaving the old
>> one in place until it's no longer needed), but they need to query the device
>> to get the erase granularity in order to do that, which is why they don't work
>> on non-flash block devices.
>
> Is there linux filesystem that can handle that? I know jffs2, but
> that's unsuitable for stuff like USB thumb drives, right?
>
> Does this sound like a fair summary?
>
> Sector writes are atomic (ATOMIC-SECTORS)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Either whole sector is correctly written or nothing is written during
> powerfail.
>
>        Unfortuantely, none of the cheap USB/SD flash cards I seen do
>        behave like this, and are unsuitable for all linux filesystems
>        I know.
>
>                An inherent problem with using flash as a normal block
>                device is that the flash erase size is bigger than
>                most filesystem sector sizes.  So when you request a
>                write, it may erase and rewrite the next 64k, 128k, or
>                even a couple megabytes on the really _big_ ones.
>
>                If you lose power in the middle of that, filesystem
>                won't notice that data in the "sectors" _after_ the
>                one your were trying to write to got trashed.

around, not after. the block you are reading could be in the middle or at 
the end of an eraseblock.

David Lang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-05  0:16     ` david
@ 2009-01-05  9:38       ` Pavel Machek
  0 siblings, 0 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-05  9:38 UTC (permalink / raw)
  To: david
  Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc


> >Sector writes are atomic (ATOMIC-SECTORS)
> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >Either whole sector is correctly written or nothing is written during
> >powerfail.
> >
> >       Unfortuantely, none of the cheap USB/SD flash cards I seen do
> >       behave like this, and are unsuitable for all linux filesystems
> >       I know.
> >
> >               An inherent problem with using flash as a normal block
> >               device is that the flash erase size is bigger than
> >               most filesystem sector sizes.  So when you request a
> >               write, it may erase and rewrite the next 64k, 128k, or
> >               even a couple megabytes on the really _big_ ones.
> >
> >               If you lose power in the middle of that, filesystem
> >               won't notice that data in the "sectors" _after_ the
> >               one your were trying to write to got trashed.
> 
> around, not after. the block you are reading could be in the middle or at 
> the end of an eraseblock.

Applied, thanks.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:55   ` Pavel Machek
  2009-01-05  0:16     ` david
@ 2009-01-05  1:50     ` Rob Landley
  2009-01-05  3:20     ` Martin K. Petersen
  2 siblings, 0 replies; 67+ messages in thread
From: Rob Landley @ 2009-01-05  1:50 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, tytso, mtk.manpages, rdunlap,
	linux-doc

On Sunday 04 January 2009 16:55:45 Pavel Machek wrote:
> On Sun 2009-01-04 13:49:49, Rob Landley wrote:
> > On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > > +behaving disk subsystem, data that have been successfully synced will
> > > +stay on the disk. Sane means:
> > > +
> > > +* writes to media never fail. Even if disk returns error condition
> > > during +  write, ext3 can't handle that correctly, because success on
> > > fsync was already +  returned when data hit the journal.
> > > +
> > > +	   (Fortunately writes failing are very uncommon on disks, as they
> > > +	   have spare sectors they use when write fails.)
> > > +
> > > +* either whole sector is correctly written or nothing is written
> > > during +  powerfail.
> > > +
> > > +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do
> > > behave +	   like this, and are unsuitable for ext3.
> >
> > Want to document the granularity issues with flash, while you're at it?
> >
> > An inherent problem with using flash as a normal block device is that the
> > flash erase size is bigger than most filesystem sector sizes.  So when
> > you request a write, it may erase and rewrite the next 64k, 128k, or even
> > a couple megabytes on the really _big_ ones.
> >
> > If you lose power in the middle of that, ext3 won't notice that data in
> > the "sectors" _after_ the one your were trying to write to got trashed.
> >
> > The flash filesystems take this into account as part of their wear
> > levelling stuff (they normally copy the entire chunk into a new chunk,
> > leaving the old one in place until it's no longer needed), but they need
> > to query the device to get the erase granularity in order to do that,
> > which is why they don't work on non-flash block devices.
>
> Is there linux filesystem that can handle that? I know jffs2, but
> that's unsuitable for stuff like USB thumb drives, right?

Any of the flash filesystems should handle that.  The main problem with jffs2 
is it doesn't scale well to large device sizes.  UBIFS is supposed to scale 
much better, but I haven't played with it yet.

And the thing about USB thumb drives is they present as a normal block device, 
_not_ as flash, so you can't _query_ their erase granularity.  (It's like 
those hardware raid cards that wouldn't tell you they were striping and such 
so you had to figure out a well-performing layout all by yourself.)   They do 
it magically behind the scenes, and if the power goes out (or you yank the 
device out unexpectedly) if they haven't got a built-in capacitor or battery 
to have enough power to complete their pending transaction, you're screwed.

Plus they do horrible wear levelling, the lot of 'em.  Read Val Henson's 
livejournal entry about it: http://valhenson.livejournal.com/25228.html

There was also a marvelous thread Linus participated in on some hardware 
industry web message board, but I have no idea where it's gone...

> Does this sound like a fair summary?

See Ted's comment.  The summary's fine, the question is where to put this sort 
of thing...

>                 If you lose power in the middle of that, filesystem
>                 won't notice that data in the "sectors" _after_ the
>                 one your were trying to write to got trashed.

Well, the journal won't notice.  An e2fsck will notice huge swaths of missing 
metadata, but won't be able to do anything about it.  (And if what got zapped 
was file _contents_ rather than metadata, you're on your own finding it.  Fun, 
isn't it?)

> 									Pavel

Rob

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-04 22:55   ` Pavel Machek
  2009-01-05  0:16     ` david
  2009-01-05  1:50     ` Rob Landley
@ 2009-01-05  3:20     ` Martin K. Petersen
  2009-01-05  9:45       ` Pavel Machek
  2 siblings, 1 reply; 67+ messages in thread
From: Martin K. Petersen @ 2009-01-05  3:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

>>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes:

Pavel> Does this sound like a fair summary?

Pavel> Sector writes are atomic (ATOMIC-SECTORS)
Pavel> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I'd just like to point out that the all-or-nothing hardware sector
atomity thing is -- to a large extent -- a myth.

It is mostly true on SCSI class devices because various UNIX, RAID array
and database vendors have spent many years leaning very hard on the
drive manufacturers to make it so.

But it's not a hard guarantee, you can't get it in writing, and it's not
in any of the standards.  Hybrid drives with flash had potential to
close that particular loophole but those appear to be dead in the water.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-05  3:20     ` Martin K. Petersen
@ 2009-01-05  9:45       ` Pavel Machek
  2009-01-05 11:28         ` Alan Cox
  2009-01-05 19:15         ` Martin K. Petersen
  0 siblings, 2 replies; 67+ messages in thread
From: Pavel Machek @ 2009-01-05  9:45 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Rob Landley, kernel list, Andrew Morton, tytso, mtk.manpages,
	rdunlap, linux-doc

> >>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes:
> 
> Pavel> Does this sound like a fair summary?
> 
> Pavel> Sector writes are atomic (ATOMIC-SECTORS)
> Pavel> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> I'd just like to point out that the all-or-nothing hardware sector
> atomity thing is -- to a large extent -- a myth.

It is a myth that linux filesystems depend on for safe operation :-(.

> It is mostly true on SCSI class devices because various UNIX, RAID array
> and database vendors have spent many years leaning very hard on the
> drive manufacturers to make it so.
> 
> But it's not a hard guarantee, you can't get it in writing, and it's not
> in any of the standards.  Hybrid drives with flash had potential to
> close that particular loophole but those appear to be dead in the water.

So "in practice it works but vendors will not guarantee that"?

How much true is it for normal SATA drives? Are there some tests I can
just run on a machine, powercycle it few times, and it tells me if my
disk is non-ATOMIC-SECTORS?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-05  9:45       ` Pavel Machek
@ 2009-01-05 11:28         ` Alan Cox
  2009-01-05 19:15         ` Martin K. Petersen
  1 sibling, 0 replies; 67+ messages in thread
From: Alan Cox @ 2009-01-05 11:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Martin K. Petersen, Rob Landley, kernel list, Andrew Morton,
	tytso, mtk.manpages, rdunlap, linux-doc

> How much true is it for normal SATA drives? Are there some tests I can
> just run on a machine, powercycle it few times, and it tells me if my
> disk is non-ATOMIC-SECTORS?

No.

And even if it did writes to one sector can damage another. The
mathematical certainly stuff lives only in the world of maths. In the real
world everything is probabilities.

Alan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-05  9:45       ` Pavel Machek
  2009-01-05 11:28         ` Alan Cox
@ 2009-01-05 19:15         ` Martin K. Petersen
  2009-01-05 20:19           ` Theodore Tso
  1 sibling, 1 reply; 67+ messages in thread
From: Martin K. Petersen @ 2009-01-05 19:15 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Martin K. Petersen, Rob Landley, kernel list, Andrew Morton,
	tytso, mtk.manpages, rdunlap, linux-doc

>>>>> "Pavel" == Pavel Machek <pavel@suse.cz> writes:

>> It is mostly true on SCSI class devices because various UNIX, RAID
>> array and database vendors have spent many years leaning very hard on
>> the drive manufacturers to make it so.
>> 
>> But it's not a hard guarantee, you can't get it in writing, and it's
>> not in any of the standards.  Hybrid drives with flash had potential
>> to close that particular loophole but those appear to be dead in the
>> water.

Pavel> So "in practice it works but vendors will not guarantee that"?

It works some of the time.  But in reality if you yank power halfway
during a write operation the end result is undefined.

The saving grace for normal users is that the potential corruption is
limited to a couple of sectors.

The current suck of flash SSDs is that the erase block size amplifies
this problem by at least one order of magnitude, often two.  I have a
couple of SSDs here that will leave my filesystem in shambles every time
the machine crashes.  I quickly got tired of reinstalling Fedora several
times per week so now my main machine is back to spinning media.

The people that truly and deeply care about this type of write atomicity
(i.e. enterprises) deploy disk arrays that will do the right thing in
face of an error.  This involves NVRAM, mirrored caches, uninterruptible
power supplies, etc.  Brute force if you will.

High-end arrays even give you atomicity at a bigger granularity such as
filesystem or database blocks.  On some storage you can say "this LUN is
used for an Oracle database that always writes in multiples of 8KB" and
the array will guarantee that each 8KB block of the I/O is written in
its entirety or not at all.  Some arrays even allow you to verify Oracle
logical block checksums to ensure that the I/O is intact and internally
consistent.

I have been bugging storage vendors about a per-I/O write atomicity
setting for a while.  But it really messes up their pipelining so they
aren't keen on the idea.  We may be able to get some of it fixed as a
side-effect of the DIF bits vs. the impending switch to 4KB sectors,
though.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: document ext3 requirements
  2009-01-05 19:15         ` Martin K. Petersen
@ 2009-01-05 20:19           ` Theodore Tso
  0 siblings, 0 replies; 67+ messages in thread
From: Theodore Tso @ 2009-01-05 20:19 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Pavel Machek, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc

On Mon, Jan 05, 2009 at 02:15:44PM -0500, Martin K. Petersen wrote:
> 
> It works some of the time.  But in reality if you yank power halfway
> during a write operation the end result is undefined.
> 
> The saving grace for normal users is that the potential corruption is
> limited to a couple of sectors.

A few years ago it was asserted to me that the internal block size for
spinning magnetic media was around 32k.  So if the hard drive doesn't
have enough of a capacitor or other energy reserve to complete its
internal read-modify-write cycle, attempts to read the 32k chunk of
disk could result in hard ECC failures that would cause the blocks in
question to all return uncorrectiable read errors when they are
accessed.

Of course, if the memory goes south first, and you're in the middle of
streaming a 128k update to the inode the filesystem, and the power
fails, and the memory start returning garbage during the DMA
operation, you may have much bigger problems.  :-)

So it's probably more than "a couple of sectors"....

> The current suck of flash SSDs is that the erase block size amplifies
> this problem by at least one order of magnitude, often two.  I have a
> couple of SSDs here that will leave my filesystem in shambles every time
> the machine crashes.  I quickly got tired of reinstalling Fedora several
> times per week so now my main machine is back to spinning media.

The erase block size is typically 1 to 4 megabytes, from my
understanding.  So yeah, that's easily 1-2 orders of magnitude.  Worse
yet, flash's sequential streaming write speeds are much slower than
hard drive's (anywhere from a factor of 3 to 12 depending on
cheap/trashy the flash drive happens to be), so that opens the time
window even further, by possibly as much as another order of magnitude.

I also suspect that HDD manufactures have learned various tricks (due
to enterprise storage/database vendors leaning on them) to make the
drives appear more atomic in the face of hard drive errors, and also,
in Pavel's case, as I recall he was using the card in a laptop where
the SD card protruded slightly from the laptop case, and it was very
easy for it to get dislodged, meaning that power failures during
writes were even more likely than you would expect with a fixed HDD or
SDD which is secured into place using screws or other more reliable
mounting hardware.

Put all of this together, given that Pavel's Really Trashy 32GB SD was
probably the full 3 orders of magnitude worse than traditional HDD,
and he was having many more failures due to physical mounting issues,
it's not surprising that most people haven't see problems with
traditional HDD's, even none of this is guaranteed by the hard drive
vendors.

> The people that truly and deeply care about this type of write atomicity
> (i.e. enterprises) deploy disk arrays that will do the right thing in
> face of an error.  This involves NVRAM, mirrored caches, uninterruptible
> power supplies, etc.  Brute force if you will.

Don't forget non-cheasy mounting options so an accidental brush
against the side of the unit doesn't cause the hard drive to become
disconnected from system and suffer a power drop.  I guess that gets
filed under "Brute force" as well.  :-)

							- Ted

P.S.  I feel obliged to point out that in my Lenovo X61s, the SD card
is flush with the laptop case when inserted, and I've never had a
problem with the SD card prematurely ejected during operaiton.   :-)

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2009-01-09 23:48 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-03 12:38 document ext3 requirements Pavel Machek
2009-01-03 21:17 ` Martin MOKREJŠ
2009-01-03 22:06   ` Pavel Machek
2009-01-03 22:17   ` Duane Griffin
2009-01-03 22:29     ` Pavel Machek
2009-01-03 23:01       ` Martin MOKREJŠ
2009-01-03 23:38         ` Duane Griffin
2009-01-03 23:50           ` Martin MOKREJŠ
2009-01-03 23:58             ` Robert Hancock
2009-01-04  0:08               ` Martin MOKREJŠ
2009-01-04 21:49               ` Ingo Oeser
2009-01-04  0:00             ` Duane Griffin
2009-01-04  0:11               ` Martin MOKREJŠ
2009-01-04  0:41                 ` Duane Griffin
2009-01-04  3:52                   ` Valdis.Kletnieks
2009-01-04 14:24                     ` Duane Griffin
2009-01-04 18:40                       ` Theodore Tso
2009-01-04 19:21                         ` Geert Uytterhoeven
2009-01-04 19:36                           ` Theodore Tso
2009-01-04 19:51                             ` Duane Griffin
2009-01-04 21:55                               ` Theodore Tso
2009-01-04 22:06                                 ` Duane Griffin
2009-01-04 22:42                           ` Bron Gondwana
2009-01-05  3:22                           ` Rob Landley
2009-01-04  0:19         ` Pavel Machek
2009-01-05  2:55           ` Rob Landley
2009-01-04 19:56         ` Rob Landley
2009-01-05 19:16           ` Theodore Tso
2009-01-06 19:20             ` Rob Landley
2009-01-06 10:08         ` Matthias Andree
2009-01-06 15:23           ` Theodore Tso
2009-01-03 23:12       ` Duane Griffin
2009-01-06 10:06       ` Matthias Andree
2009-01-04  2:32 ` Theodore Tso
2009-01-04 22:33   ` Pavel Machek
2009-01-04 22:34   ` [patch] document ext3 a bit better Pavel Machek
2009-01-05 14:57     ` Theodore Tso
2009-01-06  9:21       ` Pavel Machek
2009-01-09 23:24         ` Jiri Kosina
2009-01-09 23:36           ` Randy Dunlap
2009-01-09 23:47             ` Jiri Kosina
2009-01-04 13:35 ` document ext3 requirements Alexander E. Patrakov
2009-01-04 13:53   ` Valdis.Kletnieks
2009-01-04 18:21   ` Michael Tokarev
2009-01-04 18:38   ` Theodore Tso
2009-01-04 22:37     ` Pavel Machek
2009-01-04 23:58       ` Theodore Tso
2009-01-05 11:43     ` Alan Cox
2009-01-07 11:59       ` Rob Landley
2009-01-04 20:10   ` Pavel Machek
2009-01-04 19:49 ` Rob Landley
2009-01-04 22:06   ` Theodore Tso
2009-01-04 22:25     ` Pavel Machek
2009-01-04 23:00     ` [patch] " Pavel Machek
2009-01-05  2:42       ` Rob Landley
2009-01-05  9:54         ` Pavel Machek
2009-01-04 23:07     ` Pavel Machek
2009-01-05  1:38     ` Rob Landley
2009-01-04 22:55   ` Pavel Machek
2009-01-05  0:16     ` david
2009-01-05  9:38       ` Pavel Machek
2009-01-05  1:50     ` Rob Landley
2009-01-05  3:20     ` Martin K. Petersen
2009-01-05  9:45       ` Pavel Machek
2009-01-05 11:28         ` Alan Cox
2009-01-05 19:15         ` Martin K. Petersen
2009-01-05 20:19           ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox