* Potential data consistency issue with ASYNC_COMMIT feature
@ 2009-12-11 6:45 Oleg Drokin
2009-12-11 7:14 ` Oleg Drokin
0 siblings, 1 reply; 5+ messages in thread
From: Oleg Drokin @ 2009-12-11 6:45 UTC (permalink / raw)
To: linux-ext4; +Cc: Alex Zhuravlev, Andreas Dilger
Hello!
I think ext4 ASYNC_COMMIT feature is potentially pretty unsafe
when write-back cache is enabled on the device.
Since no barriers are ever done with this feature even if
the barriers are enabled, we might end up in the situation
where we write the journal blocks, then commit block, they
hit the device write-back cache, after that actual metadata
blocks would be allowed to go to disk and eventually they will.
In the end the device might decide to reorder some of the
actual metadata updates in front of journal updates and
if metadata updates will hit the disk and a power or other
failure occurs after that, we have inconsistent filesystem
as a result.
I do not see an easy way to remedy the problem in this case
other than to insert empty barrier after the commit block
and wait for it completion, but I think that would negate
the entire gain from this feature. I wish we actually had
real ordered writes implemented, not just barrier/FUA
sent to the device before every ordered buffer.
Am I missing something?
Thanks.
Bye,
Oleg
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Potential data consistency issue with ASYNC_COMMIT feature 2009-12-11 6:45 Potential data consistency issue with ASYNC_COMMIT feature Oleg Drokin @ 2009-12-11 7:14 ` Oleg Drokin 2009-12-11 16:01 ` Eric Sandeen 2009-12-11 20:52 ` tytso 0 siblings, 2 replies; 5+ messages in thread From: Oleg Drokin @ 2009-12-11 7:14 UTC (permalink / raw) To: linux-ext4; +Cc: Alex Zhuravlev, Andreas Dilger Whoops, nevermind, it seems blkdev_issue_flush after commit does the barrier, I see it now. It's just rhel5 kernel that is affected. On Dec 11, 2009, at 1:45 AM, Oleg Drokin wrote: > Hello! > > I think ext4 ASYNC_COMMIT feature is potentially pretty unsafe > when write-back cache is enabled on the device. > Since no barriers are ever done with this feature even if > the barriers are enabled, we might end up in the situation > where we write the journal blocks, then commit block, they > hit the device write-back cache, after that actual metadata > blocks would be allowed to go to disk and eventually they will. > > In the end the device might decide to reorder some of the > actual metadata updates in front of journal updates and > if metadata updates will hit the disk and a power or other > failure occurs after that, we have inconsistent filesystem > as a result. > > I do not see an easy way to remedy the problem in this case > other than to insert empty barrier after the commit block > and wait for it completion, but I think that would negate > the entire gain from this feature. I wish we actually had > real ordered writes implemented, not just barrier/FUA > sent to the device before every ordered buffer. > > Am I missing something? > > Thanks. > > Bye, > Oleg ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential data consistency issue with ASYNC_COMMIT feature 2009-12-11 7:14 ` Oleg Drokin @ 2009-12-11 16:01 ` Eric Sandeen 2009-12-11 20:52 ` tytso 1 sibling, 0 replies; 5+ messages in thread From: Eric Sandeen @ 2009-12-11 16:01 UTC (permalink / raw) To: Oleg Drokin; +Cc: linux-ext4, Alex Zhuravlev, Andreas Dilger Oleg Drokin wrote: > Whoops, nevermind, it seems blkdev_issue_flush after commit does the barrier, I see it now. > It's just rhel5 kernel that is affected. rhel5 necessarily lags upstream, but updates will come. We still need to do a lot of actual real-world testing of lost caches in -all- ext4 journaling modes, I think. -Eric > On Dec 11, 2009, at 1:45 AM, Oleg Drokin wrote: > >> Hello! >> >> I think ext4 ASYNC_COMMIT feature is potentially pretty unsafe >> when write-back cache is enabled on the device. >> Since no barriers are ever done with this feature even if >> the barriers are enabled, we might end up in the situation >> where we write the journal blocks, then commit block, they >> hit the device write-back cache, after that actual metadata >> blocks would be allowed to go to disk and eventually they will. >> >> In the end the device might decide to reorder some of the >> actual metadata updates in front of journal updates and >> if metadata updates will hit the disk and a power or other >> failure occurs after that, we have inconsistent filesystem >> as a result. >> >> I do not see an easy way to remedy the problem in this case >> other than to insert empty barrier after the commit block >> and wait for it completion, but I think that would negate >> the entire gain from this feature. I wish we actually had >> real ordered writes implemented, not just barrier/FUA >> sent to the device before every ordered buffer. >> >> Am I missing something? >> >> Thanks. >> >> Bye, >> Oleg > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential data consistency issue with ASYNC_COMMIT feature 2009-12-11 7:14 ` Oleg Drokin 2009-12-11 16:01 ` Eric Sandeen @ 2009-12-11 20:52 ` tytso 2009-12-16 20:33 ` Jan Kara 1 sibling, 1 reply; 5+ messages in thread From: tytso @ 2009-12-11 20:52 UTC (permalink / raw) To: Oleg Drokin; +Cc: linux-ext4, Alex Zhuravlev, Andreas Dilger On Fri, Dec 11, 2009 at 02:14:01AM -0500, Oleg Drokin wrote: > Whoops, nevermind, it seems blkdev_issue_flush after commit does the > barrier, I see it now. It's just rhel5 kernel that is affected. Yeah, the original ASYNC_COMMIT was totally unsafe, for the reason you suggested; I was able to trivially induce fs corruption after a crash. However, with the fixed async_commit code, in combination with journal checksums, we can reduce the number of barrier ops per commit from two to one, which increases the fs_mark by 50% (i.e., from 30 ops/sec to 45 ops/sec on a laptop hard drive). However, journal checksums failed horribly when we tried to enable them by default during the last merge window, because of bugs in ext4 where we were modifying certain metadata blocks (in particular superblock and xattr's) without journalling them. (Note to self; we need to back port those fixes to ext3; the lack of journalling in xattr in particular could mean that in some cases we could lose some updates that could affect SELINUX after a crash.) I think we fixed them all for 2.6.33, but we haven't had time to do the necessary testing before we enable journal checksums by default, and after additional testing, I'd like to enable async commit by default as well, since it means we'll beat the pants off of all of the other journalling file systems (XFS and JFS are doing two barrier ops per commit, if I recall correctly; not sure about btrfs) at least on that particular benchmark. Unfortunately, we probably won't be able to do that for 2.6.33; hopefully 2.6.34. - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential data consistency issue with ASYNC_COMMIT feature 2009-12-11 20:52 ` tytso @ 2009-12-16 20:33 ` Jan Kara 0 siblings, 0 replies; 5+ messages in thread From: Jan Kara @ 2009-12-16 20:33 UTC (permalink / raw) To: tytso; +Cc: Oleg Drokin, linux-ext4, Alex Zhuravlev, Andreas Dilger > However, journal checksums failed horribly when we tried to enable > them by default during the last merge window, because of bugs in ext4 > where we were modifying certain metadata blocks (in particular > superblock and xattr's) without journalling them. (Note to self; we > need to back port those fixes to ext3; the lack of journalling in > xattr in particular could mean that in some cases we could lose some > updates that could affect SELINUX after a crash.) Already done by Eric and merged ;): d965736b8cb42ae51ba9c3f13488035a98d025c6, b918397542388de75bd86c32fbfa820e5d629fa9. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-12-16 20:33 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-12-11 6:45 Potential data consistency issue with ASYNC_COMMIT feature Oleg Drokin 2009-12-11 7:14 ` Oleg Drokin 2009-12-11 16:01 ` Eric Sandeen 2009-12-11 20:52 ` tytso 2009-12-16 20:33 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).