* EXT intent logging
@ 2004-08-06 4:55 Buddy Lumpkin
2004-08-06 9:57 ` Daniel Pittman
2004-08-06 19:56 ` Theodore Ts'o
0 siblings, 2 replies; 8+ messages in thread
From: Buddy Lumpkin @ 2004-08-06 4:55 UTC (permalink / raw)
To: linux-kernel
I recently moved from a Sun/Solaris environment to a mostly linux
environment .
A large NFS server went down recently and as it rebooted, fsck ran for a
while before
the data volumes could be mounted. I noticed the filesystem was ext3 and
asked, is
journaling disabled? Why on earth is fsck running at all? The admin assured
me this
is quite normal for ext3 and after a few minutes, the system was brought
back online.
I looked at the configuration and it turns out the system was mounted
DATA=ORDERED.
That name ordered sounded to me like it should do the kind of intent logging
that I am
accustomed to on UFS and VXFS. I was very surprised to read that ext3
updates the
standard data/metadata blocks prior to updating the journal. While im sure
this achieves
what the snippet from the ext3 faq says below: "this mode guarantees that
after a crash,
files will never contain stale data blocks from old files", I don't see how
fsck time can be
reduced entirely with this journal method.
To eliminate fsck on large filesystems, wouldn't you have to update the
journal first, then
update the data blocks? This way in the event of a crash, the last entries
in the log would
represent the last I/O operations that were "intended" and those blocks
could be inspected
for consistency.
This of course is my non-kernel hacker understanding of how this works, but
I can say
one thing. With UFS mounted with -o logging, I can start a ton of reads and
writes and
just kill the power on a system and not expect to see any delay when the
system comes
back up.
Of course, UFS logging does not log data, only metadata (as data=ordered or
data=writeback options do).
Also, vxfs, which behaves more like data=journal I believe, also spends very
little
time replaying the journal after a nasty crash.
We wanted the journal to be updated first, but we couldn't understand why we
had to opt for data
journaling to accomplish this. The unfortunate thing is, we have seen
corruption as a result
of the data=journal option.
Could someone explain why there isn't an option in ext3 to only log
metadata, but completely
avoid fsck by updating the log before the data blocks?
And im sure I don't need to ask anyone to correct me if I am misguided in my
thinking. I have found
on lkml that kind of guidance usually comes for free m
--Buddy
---------------------
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: EXT intent logging
2004-08-06 4:55 EXT intent logging Buddy Lumpkin
@ 2004-08-06 9:57 ` Daniel Pittman
2004-08-06 13:22 ` Doug McNaught
2004-08-06 19:56 ` Theodore Ts'o
1 sibling, 1 reply; 8+ messages in thread
From: Daniel Pittman @ 2004-08-06 9:57 UTC (permalink / raw)
To: linux-kernel
On 6 Aug 2004, Buddy Lumpkin wrote:
> I recently moved from a Sun/Solaris environment to a mostly linux environment
>
> A large NFS server went down recently and as it rebooted, fsck ran for
> a while before the data volumes could be mounted. I noticed the
> filesystem was ext3 and asked, is journaling disabled? Why on earth is
> fsck running at all? The admin assured me this is quite normal for
> ext3 and after a few minutes, the system was brought back online.
What is normal is that ext3 will perform an *occasional* fsck - by
default, once a month or every thirty-odd mounts - to catch any
corruption that has been missed by the journaling.
Also, the fsck will replay the journal outside the kernel if possible,
so you may have witnessed the journal replay.
This is done because it allows multiple filesystems to replay the
journal in parallel rather than sequentially as doing it in-kernel would
require.
Regards,
Daniel
--
If you ever reach total enlightenment while you're drinking a beer,
I bet it makes beer shoot out your nose.
-- Jack Handy
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: EXT intent logging
2004-08-06 9:57 ` Daniel Pittman
@ 2004-08-06 13:22 ` Doug McNaught
2004-08-06 16:36 ` Buddy Lumpkin
0 siblings, 1 reply; 8+ messages in thread
From: Doug McNaught @ 2004-08-06 13:22 UTC (permalink / raw)
To: Daniel Pittman; +Cc: linux-kernel, Buddy Lumpkin
Daniel Pittman <daniel@rimspace.net> writes:
> What is normal is that ext3 will perform an *occasional* fsck - by
> default, once a month or every thirty-odd mounts - to catch any
> corruption that has been missed by the journaling.
And if you don't want this to happen, you can use 'tunefs' to turn it
off, and rely entirely on journal replays.
-Doug
--
Let us cross over the river, and rest under the shade of the trees.
--T. J. Jackson, 1863
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: EXT intent logging
2004-08-06 13:22 ` Doug McNaught
@ 2004-08-06 16:36 ` Buddy Lumpkin
2004-08-06 18:46 ` Bernd Eckenfels
0 siblings, 1 reply; 8+ messages in thread
From: Buddy Lumpkin @ 2004-08-06 16:36 UTC (permalink / raw)
To: 'Doug McNaught', 'Daniel Pittman'; +Cc: linux-kernel
But if it doesn't do writes to the journal first, how does it identify
transactions that were "in flight" when the system went down to reverse
them? How do you catch a parial update to an inode for instance?
--Buddy
-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Doug McNaught
Sent: Friday, August 06, 2004 6:23 AM
To: Daniel Pittman
Cc: linux-kernel@vger.kernel.org; Buddy Lumpkin
Subject: Re: EXT intent logging
Daniel Pittman <daniel@rimspace.net> writes:
> What is normal is that ext3 will perform an *occasional* fsck - by
> default, once a month or every thirty-odd mounts - to catch any
> corruption that has been missed by the journaling.
And if you don't want this to happen, you can use 'tunefs' to turn it
off, and rely entirely on journal replays.
-Doug
--
Let us cross over the river, and rest under the shade of the trees.
--T. J. Jackson, 1863
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: EXT intent logging
2004-08-06 16:36 ` Buddy Lumpkin
@ 2004-08-06 18:46 ` Bernd Eckenfels
0 siblings, 0 replies; 8+ messages in thread
From: Bernd Eckenfels @ 2004-08-06 18:46 UTC (permalink / raw)
To: linux-kernel
In article <S268180AbUHFQey/20040806163555Z+829@vger.kernel.org> you wrote:
> But if it doesn't do writes to the journal first, how does it identify
> transactions that were "in flight" when the system went down to reverse
> them? How do you catch a parial update to an inode for instance?
the journal is about meta data, if you want data journaling, try data=journal.
Gruss
Bernd
--
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: EXT intent logging
2004-08-06 4:55 EXT intent logging Buddy Lumpkin
2004-08-06 9:57 ` Daniel Pittman
@ 2004-08-06 19:56 ` Theodore Ts'o
2004-08-07 4:50 ` Buddy Lumpkin
2004-08-08 1:54 ` Thomas Zimmerman
1 sibling, 2 replies; 8+ messages in thread
From: Theodore Ts'o @ 2004-08-06 19:56 UTC (permalink / raw)
To: Buddy Lumpkin; +Cc: linux-kernel
On Thu, Aug 05, 2004 at 09:55:28PM -0700, Buddy Lumpkin wrote:
> A large NFS server went down recently and as it rebooted, fsck ran
> for a while before the data volumes could be mounted. I noticed the
> filesystem was ext3 and asked, is journaling disabled? Why on earth
> is fsck running at all? The admin assured me this is quite normal
> for ext3 and after a few minutes, the system was brought back
> online.
Fsck replays the journal for ext3 filesystems that were not cleanly
unmounted. That is, the metadata (and possibly data) blocks in the
journal are written to the correct location on disk in order to make
the filesystem consistent.
> I looked at the configuration and it turns out the system was
> mounted DATA=ORDERED. That name ordered sounded to me like it
> should do the kind of intent logging that I am accustomed to on UFS
> and VXFS. I was very surprised to read that ext3 updates the
> standard data/metadata blocks prior to updating the journal.
Incorrect. Only data blocks are forced out to disk before metadata
blocks (which are written to the journal first) changes are committed.
The changes to the metadata blocks are not written to disk (outside of
the journal) until after the transaction is committed.
> To eliminate fsck on large filesystems, wouldn't you have to update
> the journal first, then update the data blocks? This way in the
> event of a crash, the last entries in the log would represent the
> last I/O operations that were "intended" and those blocks could be
> inspected for consistency.
See above. The metadata changes are written out to the journal first,
but we want to make sure that before those changes are committed, that
the data blocks pointed to by the metadata blocks are valid. If you
mount -o data=writeback, then data blocks constraint is not
enforced. This still eliminates the need for a full fsck, but even
though the filesystem is consistent after the journal is replayed, the
metadata blocks may be pointing at unwritten data blocks which only
contain garbage.
> Could someone explain why there isn't an option in ext3 to only log
> metadata, but completely avoid fsck by updating the log before the
> data blocks?
"mount -o data=writeback" only logs metadata. This seems to be what
you are requesting.
"mount -o data=journal" logs metadata blocks and data blocks into
journal first, and then after the transaction commits, the metadata
and data blocks are written to their final location on disk. The
problem with this is that all your write bandwidth is cut in half
since all block writes get written twice to disk --- once to the
journal, and once to the final location on disk.
"mount -o data=ordered" simply defers the transaction commit until the
data blocks are written to their final location on disk. If the data
writes do not make it onto the disk, before the system crashes, the
transaction never commits and the metadata changes won't get replayed
on filesystem recovery.
In all cases, however, the need for a full fsck is not needed. The
reason why it is useful to have e2fsck do the journal replay, as
opposed to letting the kernel do it when you try to mount the
filesystem, is that if you have multiple disk drives, replaying
journal in userspace allows multiple filesystems to be recovered in
parallel, instead of one filesystem at a time. Linux's fsck is
intellignent and will spawn off multiple copies of e2fsck, one for
each filesystem, so long as they are on separate disk spindles.
(Running two cpoies of e2fsck on different partitions on the same disk
drive is pointless, since the two e2fsck processes simply thrashes the
disk heads and get in the way of each other.)
Hope this helps,
- Ted
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: EXT intent logging
2004-08-06 19:56 ` Theodore Ts'o
@ 2004-08-07 4:50 ` Buddy Lumpkin
2004-08-08 1:54 ` Thomas Zimmerman
1 sibling, 0 replies; 8+ messages in thread
From: Buddy Lumpkin @ 2004-08-07 4:50 UTC (permalink / raw)
To: 'Theodore Ts'o'; +Cc: linux-kernel
Thanks Ted,
This clears up a lot of bad assumptions I made while reading vague
descriptions about the different ext3 journal options in miscellaneous
places.
--Buddy
-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Theodore Ts'o
Sent: Friday, August 06, 2004 12:56 PM
To: Buddy Lumpkin
Cc: linux-kernel@vger.kernel.org
Subject: Re: EXT intent logging
On Thu, Aug 05, 2004 at 09:55:28PM -0700, Buddy Lumpkin wrote:
> A large NFS server went down recently and as it rebooted, fsck ran
> for a while before the data volumes could be mounted. I noticed the
> filesystem was ext3 and asked, is journaling disabled? Why on earth
> is fsck running at all? The admin assured me this is quite normal
> for ext3 and after a few minutes, the system was brought back
> online.
Fsck replays the journal for ext3 filesystems that were not cleanly
unmounted. That is, the metadata (and possibly data) blocks in the
journal are written to the correct location on disk in order to make
the filesystem consistent.
> I looked at the configuration and it turns out the system was
> mounted DATA=ORDERED. That name ordered sounded to me like it
> should do the kind of intent logging that I am accustomed to on UFS
> and VXFS. I was very surprised to read that ext3 updates the
> standard data/metadata blocks prior to updating the journal.
Incorrect. Only data blocks are forced out to disk before metadata
blocks (which are written to the journal first) changes are committed.
The changes to the metadata blocks are not written to disk (outside of
the journal) until after the transaction is committed.
> To eliminate fsck on large filesystems, wouldn't you have to update
> the journal first, then update the data blocks? This way in the
> event of a crash, the last entries in the log would represent the
> last I/O operations that were "intended" and those blocks could be
> inspected for consistency.
See above. The metadata changes are written out to the journal first,
but we want to make sure that before those changes are committed, that
the data blocks pointed to by the metadata blocks are valid. If you
mount -o data=writeback, then data blocks constraint is not
enforced. This still eliminates the need for a full fsck, but even
though the filesystem is consistent after the journal is replayed, the
metadata blocks may be pointing at unwritten data blocks which only
contain garbage.
> Could someone explain why there isn't an option in ext3 to only log
> metadata, but completely avoid fsck by updating the log before the
> data blocks?
"mount -o data=writeback" only logs metadata. This seems to be what
you are requesting.
"mount -o data=journal" logs metadata blocks and data blocks into
journal first, and then after the transaction commits, the metadata
and data blocks are written to their final location on disk. The
problem with this is that all your write bandwidth is cut in half
since all block writes get written twice to disk --- once to the
journal, and once to the final location on disk.
"mount -o data=ordered" simply defers the transaction commit until the
data blocks are written to their final location on disk. If the data
writes do not make it onto the disk, before the system crashes, the
transaction never commits and the metadata changes won't get replayed
on filesystem recovery.
In all cases, however, the need for a full fsck is not needed. The
reason why it is useful to have e2fsck do the journal replay, as
opposed to letting the kernel do it when you try to mount the
filesystem, is that if you have multiple disk drives, replaying
journal in userspace allows multiple filesystems to be recovered in
parallel, instead of one filesystem at a time. Linux's fsck is
intellignent and will spawn off multiple copies of e2fsck, one for
each filesystem, so long as they are on separate disk spindles.
(Running two cpoies of e2fsck on different partitions on the same disk
drive is pointless, since the two e2fsck processes simply thrashes the
disk heads and get in the way of each other.)
Hope this helps,
- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: EXT intent logging
2004-08-06 19:56 ` Theodore Ts'o
2004-08-07 4:50 ` Buddy Lumpkin
@ 2004-08-08 1:54 ` Thomas Zimmerman
1 sibling, 0 replies; 8+ messages in thread
From: Thomas Zimmerman @ 2004-08-08 1:54 UTC (permalink / raw)
To: linux-kernel; +Cc: Theodore Ts'o, Buddy Lumpkin
On 06-Aug 03:56, Theodore Ts'o wrote:
> "mount -o data=journal" logs metadata blocks and data blocks into
> journal first, and then after the transaction commits, the metadata
> and data blocks are written to their final location on disk. The
> problem with this is that all your write bandwidth is cut in half
> since all block writes get written twice to disk --- once to the
> journal, and once to the final location on disk.
While you do half your write bandwidth, there arn't any _seeks_ while
writing to the journal. For IO workloads where a sync allows you to
move on to another job, this can speed up your workload. Mail delivery
on an old laptop disk was ~2 as fast using ext3 data=journal...
Thomas
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2004-08-08 1:57 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-06 4:55 EXT intent logging Buddy Lumpkin
2004-08-06 9:57 ` Daniel Pittman
2004-08-06 13:22 ` Doug McNaught
2004-08-06 16:36 ` Buddy Lumpkin
2004-08-06 18:46 ` Bernd Eckenfels
2004-08-06 19:56 ` Theodore Ts'o
2004-08-07 4:50 ` Buddy Lumpkin
2004-08-08 1:54 ` Thomas Zimmerman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox