* EXT intent logging
@ 2004-08-06 4:55 Buddy Lumpkin
2004-08-06 9:57 ` Daniel Pittman
2004-08-06 19:56 ` Theodore Ts'o
0 siblings, 2 replies; 8+ messages in thread
From: Buddy Lumpkin @ 2004-08-06 4:55 UTC (permalink / raw)
To: linux-kernel
I recently moved from a Sun/Solaris environment to a mostly linux
environment .
A large NFS server went down recently and as it rebooted, fsck ran for a
while before
the data volumes could be mounted. I noticed the filesystem was ext3 and
asked, is
journaling disabled? Why on earth is fsck running at all? The admin assured
me this
is quite normal for ext3 and after a few minutes, the system was brought
back online.
I looked at the configuration and it turns out the system was mounted
DATA=ORDERED.
That name ordered sounded to me like it should do the kind of intent logging
that I am
accustomed to on UFS and VXFS. I was very surprised to read that ext3
updates the
standard data/metadata blocks prior to updating the journal. While im sure
this achieves
what the snippet from the ext3 faq says below: "this mode guarantees that
after a crash,
files will never contain stale data blocks from old files", I don't see how
fsck time can be
reduced entirely with this journal method.
To eliminate fsck on large filesystems, wouldn't you have to update the
journal first, then
update the data blocks? This way in the event of a crash, the last entries
in the log would
represent the last I/O operations that were "intended" and those blocks
could be inspected
for consistency.
This of course is my non-kernel hacker understanding of how this works, but
I can say
one thing. With UFS mounted with -o logging, I can start a ton of reads and
writes and
just kill the power on a system and not expect to see any delay when the
system comes
back up.
Of course, UFS logging does not log data, only metadata (as data=ordered or
data=writeback options do).
Also, vxfs, which behaves more like data=journal I believe, also spends very
little
time replaying the journal after a nasty crash.
We wanted the journal to be updated first, but we couldn't understand why we
had to opt for data
journaling to accomplish this. The unfortunate thing is, we have seen
corruption as a result
of the data=journal option.
Could someone explain why there isn't an option in ext3 to only log
metadata, but completely
avoid fsck by updating the log before the data blocks?
And im sure I don't need to ask anyone to correct me if I am misguided in my
thinking. I have found
on lkml that kind of guidance usually comes for free m
--Buddy
---------------------
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: EXT intent logging 2004-08-06 4:55 EXT intent logging Buddy Lumpkin @ 2004-08-06 9:57 ` Daniel Pittman 2004-08-06 13:22 ` Doug McNaught 2004-08-06 19:56 ` Theodore Ts'o 1 sibling, 1 reply; 8+ messages in thread From: Daniel Pittman @ 2004-08-06 9:57 UTC (permalink / raw) To: linux-kernel On 6 Aug 2004, Buddy Lumpkin wrote: > I recently moved from a Sun/Solaris environment to a mostly linux environment > > A large NFS server went down recently and as it rebooted, fsck ran for > a while before the data volumes could be mounted. I noticed the > filesystem was ext3 and asked, is journaling disabled? Why on earth is > fsck running at all? The admin assured me this is quite normal for > ext3 and after a few minutes, the system was brought back online. What is normal is that ext3 will perform an *occasional* fsck - by default, once a month or every thirty-odd mounts - to catch any corruption that has been missed by the journaling. Also, the fsck will replay the journal outside the kernel if possible, so you may have witnessed the journal replay. This is done because it allows multiple filesystems to replay the journal in parallel rather than sequentially as doing it in-kernel would require. Regards, Daniel -- If you ever reach total enlightenment while you're drinking a beer, I bet it makes beer shoot out your nose. -- Jack Handy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: EXT intent logging 2004-08-06 9:57 ` Daniel Pittman @ 2004-08-06 13:22 ` Doug McNaught 2004-08-06 16:36 ` Buddy Lumpkin 0 siblings, 1 reply; 8+ messages in thread From: Doug McNaught @ 2004-08-06 13:22 UTC (permalink / raw) To: Daniel Pittman; +Cc: linux-kernel, Buddy Lumpkin Daniel Pittman <daniel@rimspace.net> writes: > What is normal is that ext3 will perform an *occasional* fsck - by > default, once a month or every thirty-odd mounts - to catch any > corruption that has been missed by the journaling. And if you don't want this to happen, you can use 'tunefs' to turn it off, and rely entirely on journal replays. -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863 ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: EXT intent logging 2004-08-06 13:22 ` Doug McNaught @ 2004-08-06 16:36 ` Buddy Lumpkin 2004-08-06 18:46 ` Bernd Eckenfels 0 siblings, 1 reply; 8+ messages in thread From: Buddy Lumpkin @ 2004-08-06 16:36 UTC (permalink / raw) To: 'Doug McNaught', 'Daniel Pittman'; +Cc: linux-kernel But if it doesn't do writes to the journal first, how does it identify transactions that were "in flight" when the system went down to reverse them? How do you catch a parial update to an inode for instance? --Buddy -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Doug McNaught Sent: Friday, August 06, 2004 6:23 AM To: Daniel Pittman Cc: linux-kernel@vger.kernel.org; Buddy Lumpkin Subject: Re: EXT intent logging Daniel Pittman <daniel@rimspace.net> writes: > What is normal is that ext3 will perform an *occasional* fsck - by > default, once a month or every thirty-odd mounts - to catch any > corruption that has been missed by the journaling. And if you don't want this to happen, you can use 'tunefs' to turn it off, and rely entirely on journal replays. -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: EXT intent logging 2004-08-06 16:36 ` Buddy Lumpkin @ 2004-08-06 18:46 ` Bernd Eckenfels 0 siblings, 0 replies; 8+ messages in thread From: Bernd Eckenfels @ 2004-08-06 18:46 UTC (permalink / raw) To: linux-kernel In article <S268180AbUHFQey/20040806163555Z+829@vger.kernel.org> you wrote: > But if it doesn't do writes to the journal first, how does it identify > transactions that were "in flight" when the system went down to reverse > them? How do you catch a parial update to an inode for instance? the journal is about meta data, if you want data journaling, try data=journal. Gruss Bernd -- eckes privat - http://www.eckes.org/ Project Freefire - http://www.freefire.org/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: EXT intent logging 2004-08-06 4:55 EXT intent logging Buddy Lumpkin 2004-08-06 9:57 ` Daniel Pittman @ 2004-08-06 19:56 ` Theodore Ts'o 2004-08-07 4:50 ` Buddy Lumpkin 2004-08-08 1:54 ` Thomas Zimmerman 1 sibling, 2 replies; 8+ messages in thread From: Theodore Ts'o @ 2004-08-06 19:56 UTC (permalink / raw) To: Buddy Lumpkin; +Cc: linux-kernel On Thu, Aug 05, 2004 at 09:55:28PM -0700, Buddy Lumpkin wrote: > A large NFS server went down recently and as it rebooted, fsck ran > for a while before the data volumes could be mounted. I noticed the > filesystem was ext3 and asked, is journaling disabled? Why on earth > is fsck running at all? The admin assured me this is quite normal > for ext3 and after a few minutes, the system was brought back > online. Fsck replays the journal for ext3 filesystems that were not cleanly unmounted. That is, the metadata (and possibly data) blocks in the journal are written to the correct location on disk in order to make the filesystem consistent. > I looked at the configuration and it turns out the system was > mounted DATA=ORDERED. That name ordered sounded to me like it > should do the kind of intent logging that I am accustomed to on UFS > and VXFS. I was very surprised to read that ext3 updates the > standard data/metadata blocks prior to updating the journal. Incorrect. Only data blocks are forced out to disk before metadata blocks (which are written to the journal first) changes are committed. The changes to the metadata blocks are not written to disk (outside of the journal) until after the transaction is committed. > To eliminate fsck on large filesystems, wouldn't you have to update > the journal first, then update the data blocks? This way in the > event of a crash, the last entries in the log would represent the > last I/O operations that were "intended" and those blocks could be > inspected for consistency. See above. The metadata changes are written out to the journal first, but we want to make sure that before those changes are committed, that the data blocks pointed to by the metadata blocks are valid. If you mount -o data=writeback, then data blocks constraint is not enforced. This still eliminates the need for a full fsck, but even though the filesystem is consistent after the journal is replayed, the metadata blocks may be pointing at unwritten data blocks which only contain garbage. > Could someone explain why there isn't an option in ext3 to only log > metadata, but completely avoid fsck by updating the log before the > data blocks? "mount -o data=writeback" only logs metadata. This seems to be what you are requesting. "mount -o data=journal" logs metadata blocks and data blocks into journal first, and then after the transaction commits, the metadata and data blocks are written to their final location on disk. The problem with this is that all your write bandwidth is cut in half since all block writes get written twice to disk --- once to the journal, and once to the final location on disk. "mount -o data=ordered" simply defers the transaction commit until the data blocks are written to their final location on disk. If the data writes do not make it onto the disk, before the system crashes, the transaction never commits and the metadata changes won't get replayed on filesystem recovery. In all cases, however, the need for a full fsck is not needed. The reason why it is useful to have e2fsck do the journal replay, as opposed to letting the kernel do it when you try to mount the filesystem, is that if you have multiple disk drives, replaying journal in userspace allows multiple filesystems to be recovered in parallel, instead of one filesystem at a time. Linux's fsck is intellignent and will spawn off multiple copies of e2fsck, one for each filesystem, so long as they are on separate disk spindles. (Running two cpoies of e2fsck on different partitions on the same disk drive is pointless, since the two e2fsck processes simply thrashes the disk heads and get in the way of each other.) Hope this helps, - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: EXT intent logging 2004-08-06 19:56 ` Theodore Ts'o @ 2004-08-07 4:50 ` Buddy Lumpkin 2004-08-08 1:54 ` Thomas Zimmerman 1 sibling, 0 replies; 8+ messages in thread From: Buddy Lumpkin @ 2004-08-07 4:50 UTC (permalink / raw) To: 'Theodore Ts'o'; +Cc: linux-kernel Thanks Ted, This clears up a lot of bad assumptions I made while reading vague descriptions about the different ext3 journal options in miscellaneous places. --Buddy -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Theodore Ts'o Sent: Friday, August 06, 2004 12:56 PM To: Buddy Lumpkin Cc: linux-kernel@vger.kernel.org Subject: Re: EXT intent logging On Thu, Aug 05, 2004 at 09:55:28PM -0700, Buddy Lumpkin wrote: > A large NFS server went down recently and as it rebooted, fsck ran > for a while before the data volumes could be mounted. I noticed the > filesystem was ext3 and asked, is journaling disabled? Why on earth > is fsck running at all? The admin assured me this is quite normal > for ext3 and after a few minutes, the system was brought back > online. Fsck replays the journal for ext3 filesystems that were not cleanly unmounted. That is, the metadata (and possibly data) blocks in the journal are written to the correct location on disk in order to make the filesystem consistent. > I looked at the configuration and it turns out the system was > mounted DATA=ORDERED. That name ordered sounded to me like it > should do the kind of intent logging that I am accustomed to on UFS > and VXFS. I was very surprised to read that ext3 updates the > standard data/metadata blocks prior to updating the journal. Incorrect. Only data blocks are forced out to disk before metadata blocks (which are written to the journal first) changes are committed. The changes to the metadata blocks are not written to disk (outside of the journal) until after the transaction is committed. > To eliminate fsck on large filesystems, wouldn't you have to update > the journal first, then update the data blocks? This way in the > event of a crash, the last entries in the log would represent the > last I/O operations that were "intended" and those blocks could be > inspected for consistency. See above. The metadata changes are written out to the journal first, but we want to make sure that before those changes are committed, that the data blocks pointed to by the metadata blocks are valid. If you mount -o data=writeback, then data blocks constraint is not enforced. This still eliminates the need for a full fsck, but even though the filesystem is consistent after the journal is replayed, the metadata blocks may be pointing at unwritten data blocks which only contain garbage. > Could someone explain why there isn't an option in ext3 to only log > metadata, but completely avoid fsck by updating the log before the > data blocks? "mount -o data=writeback" only logs metadata. This seems to be what you are requesting. "mount -o data=journal" logs metadata blocks and data blocks into journal first, and then after the transaction commits, the metadata and data blocks are written to their final location on disk. The problem with this is that all your write bandwidth is cut in half since all block writes get written twice to disk --- once to the journal, and once to the final location on disk. "mount -o data=ordered" simply defers the transaction commit until the data blocks are written to their final location on disk. If the data writes do not make it onto the disk, before the system crashes, the transaction never commits and the metadata changes won't get replayed on filesystem recovery. In all cases, however, the need for a full fsck is not needed. The reason why it is useful to have e2fsck do the journal replay, as opposed to letting the kernel do it when you try to mount the filesystem, is that if you have multiple disk drives, replaying journal in userspace allows multiple filesystems to be recovered in parallel, instead of one filesystem at a time. Linux's fsck is intellignent and will spawn off multiple copies of e2fsck, one for each filesystem, so long as they are on separate disk spindles. (Running two cpoies of e2fsck on different partitions on the same disk drive is pointless, since the two e2fsck processes simply thrashes the disk heads and get in the way of each other.) Hope this helps, - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: EXT intent logging 2004-08-06 19:56 ` Theodore Ts'o 2004-08-07 4:50 ` Buddy Lumpkin @ 2004-08-08 1:54 ` Thomas Zimmerman 1 sibling, 0 replies; 8+ messages in thread From: Thomas Zimmerman @ 2004-08-08 1:54 UTC (permalink / raw) To: linux-kernel; +Cc: Theodore Ts'o, Buddy Lumpkin On 06-Aug 03:56, Theodore Ts'o wrote: > "mount -o data=journal" logs metadata blocks and data blocks into > journal first, and then after the transaction commits, the metadata > and data blocks are written to their final location on disk. The > problem with this is that all your write bandwidth is cut in half > since all block writes get written twice to disk --- once to the > journal, and once to the final location on disk. While you do half your write bandwidth, there arn't any _seeks_ while writing to the journal. For IO workloads where a sync allows you to move on to another job, this can speed up your workload. Mail delivery on an old laptop disk was ~2 as fast using ext3 data=journal... Thomas ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2004-08-08 1:57 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-08-06 4:55 EXT intent logging Buddy Lumpkin 2004-08-06 9:57 ` Daniel Pittman 2004-08-06 13:22 ` Doug McNaught 2004-08-06 16:36 ` Buddy Lumpkin 2004-08-06 18:46 ` Bernd Eckenfels 2004-08-06 19:56 ` Theodore Ts'o 2004-08-07 4:50 ` Buddy Lumpkin 2004-08-08 1:54 ` Thomas Zimmerman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox