* Ordering of directory operations maintained across system crashes in Btrfs? @ 2014-02-26 2:01 thanumalayan mad 2014-03-03 17:30 ` thanumalayan mad 2014-03-03 17:43 ` Chris Mason 0 siblings, 2 replies; 5+ messages in thread From: thanumalayan mad @ 2014-02-26 2:01 UTC (permalink / raw) To: linux-btrfs Hi all, Slightly complicated question. Assume I do two directory operations in a Btrfs partition (such as an unlink() and a rename()), one after the other, and a crash happens after the rename(). Can Btrfs (the current version) send the second operation to the disk first, so that after the crash, I observe the effects of rename() but not the effects of the unlink()? I think I am observing Btrfs re-ordering an unlink() and a rename(), and I just want to confirm that my observation is true. Also, if Btrfs does send directory operations to disk out of order, is there some limitation on this? Like, is this restricted to only unlink() and rename()? I am looking at some (buggy) applications that use Btrfs, and this behavior seems to affect them. Thanks, Thanu ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Ordering of directory operations maintained across system crashes in Btrfs? 2014-02-26 2:01 Ordering of directory operations maintained across system crashes in Btrfs? thanumalayan mad @ 2014-03-03 17:30 ` thanumalayan mad 2014-03-03 17:43 ` Chris Mason 1 sibling, 0 replies; 5+ messages in thread From: thanumalayan mad @ 2014-03-03 17:30 UTC (permalink / raw) To: linux-btrfs Any ideas about this? Guessed-up, not-entirely-sure answers would help too. An example application bug that would be affected by this is from LevelDB: https://code.google.com/p/leveldb/issues/detail?id=189 Thanks, Thanu On Tue, Feb 25, 2014 at 8:01 PM, thanumalayan mad <madthanu@gmail.com> wrote: > Hi all, > > Slightly complicated question. > > Assume I do two directory operations in a Btrfs partition (such as an > unlink() and a rename()), one after the other, and a crash happens > after the rename(). Can Btrfs (the current version) send the second > operation to the disk first, so that after the crash, I observe the > effects of rename() but not the effects of the unlink()? > > I think I am observing Btrfs re-ordering an unlink() and a rename(), > and I just want to confirm that my observation is true. Also, if Btrfs > does send directory operations to disk out of order, is there some > limitation on this? Like, is this restricted to only unlink() and > rename()? > > I am looking at some (buggy) applications that use Btrfs, and this > behavior seems to affect them. > > Thanks, > Thanu ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Ordering of directory operations maintained across system crashes in Btrfs? 2014-02-26 2:01 Ordering of directory operations maintained across system crashes in Btrfs? thanumalayan mad 2014-03-03 17:30 ` thanumalayan mad @ 2014-03-03 17:43 ` Chris Mason 2014-03-03 17:56 ` thanumalayan mad 1 sibling, 1 reply; 5+ messages in thread From: Chris Mason @ 2014-03-03 17:43 UTC (permalink / raw) To: madthanu, linux-btrfs On 02/25/2014 09:01 PM, thanumalayan mad wrote: > Hi all, > > Slightly complicated question. > > Assume I do two directory operations in a Btrfs partition (such as an > unlink() and a rename()), one after the other, and a crash happens > after the rename(). Can Btrfs (the current version) send the second > operation to the disk first, so that after the crash, I observe the > effects of rename() but not the effects of the unlink()? > > I think I am observing Btrfs re-ordering an unlink() and a rename(), > and I just want to confirm that my observation is true. Also, if Btrfs > does send directory operations to disk out of order, is there some > limitation on this? Like, is this restricted to only unlink() and > rename()? > > I am looking at some (buggy) applications that use Btrfs, and this > behavior seems to affect them. There isn't a single answer for this one. You might have Thread A: ulink(foo); rename(somefile, somefile2); <crash> This should always have the rename happen before or in the same transaction as the rename. Thread A: ulink(dirA/foo); rename(dirB/somefile, dirB/somefile2); Here you're at the mercy of what is happening in dirB. If someone fsyncs that directory, it may hit the disk before the unlink. Thread A: ulink(foo); rename(somefile, somefile2); fsync(somefile); This one is even fuzzier. Backrefs allow us to do some file fsyncs without touching the directory, making it possible the unlink will hit disk after the fsync. -chris ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Ordering of directory operations maintained across system crashes in Btrfs? 2014-03-03 17:43 ` Chris Mason @ 2014-03-03 17:56 ` thanumalayan mad 2014-03-13 10:01 ` Goswin von Brederlow 0 siblings, 1 reply; 5+ messages in thread From: thanumalayan mad @ 2014-03-03 17:56 UTC (permalink / raw) To: Chris Mason; +Cc: linux-btrfs Chris, Great, thanks. Any guesses whether other filesystems (disk-based) do things similar to the last two examples you pointed out? Saying "we think 3 normal filesystems reorder stuff" seems to motivate application developers to fix bugs ... Also, just for more information, the sequence we observed was, Thread A: unlink(foo) rename(somefile X, somefile Y) fsync(somefile Z) The source and destination of the renamed file are unrelated to the fsync. But the rename happens in the fsync()'s transaction, while unlink() is delayed. I guess this has something to do with backrefs too. Thanks, Thanu On Mon, Mar 3, 2014 at 11:43 AM, Chris Mason <clm@fb.com> wrote: > On 02/25/2014 09:01 PM, thanumalayan mad wrote: >> >> Hi all, >> >> Slightly complicated question. >> >> Assume I do two directory operations in a Btrfs partition (such as an >> unlink() and a rename()), one after the other, and a crash happens >> after the rename(). Can Btrfs (the current version) send the second >> operation to the disk first, so that after the crash, I observe the >> effects of rename() but not the effects of the unlink()? >> >> I think I am observing Btrfs re-ordering an unlink() and a rename(), >> and I just want to confirm that my observation is true. Also, if Btrfs >> does send directory operations to disk out of order, is there some >> limitation on this? Like, is this restricted to only unlink() and >> rename()? >> >> I am looking at some (buggy) applications that use Btrfs, and this >> behavior seems to affect them. > > > There isn't a single answer for this one. > > You might have > > Thread A: > > ulink(foo); > rename(somefile, somefile2); > <crash> > > This should always have the rename happen before or in the same transaction > as the rename. > > Thread A: > > ulink(dirA/foo); > rename(dirB/somefile, dirB/somefile2); > > Here you're at the mercy of what is happening in dirB. If someone fsyncs > that directory, it may hit the disk before the unlink. > > Thread A: > > ulink(foo); > rename(somefile, somefile2); > fsync(somefile); > > This one is even fuzzier. Backrefs allow us to do some file fsyncs without > touching the directory, making it possible the unlink will hit disk after > the fsync. > > -chris > > > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Ordering of directory operations maintained across system crashes in Btrfs? 2014-03-03 17:56 ` thanumalayan mad @ 2014-03-13 10:01 ` Goswin von Brederlow 0 siblings, 0 replies; 5+ messages in thread From: Goswin von Brederlow @ 2014-03-13 10:01 UTC (permalink / raw) To: thanumalayan mad; +Cc: Chris Mason, linux-btrfs On Mon, Mar 03, 2014 at 11:56:49AM -0600, thanumalayan mad wrote: > Chris, > > Great, thanks. Any guesses whether other filesystems (disk-based) do > things similar to the last two examples you pointed out? Saying "we > think 3 normal filesystems reorder stuff" seems to motivate > application developers to fix bugs ... > > Also, just for more information, the sequence we observed was, > > Thread A: > > unlink(foo) > rename(somefile X, somefile Y) > fsync(somefile Z) > > The source and destination of the renamed file are unrelated to the > fsync. But the rename happens in the fsync()'s transaction, while > unlink() is delayed. I guess this has something to do with backrefs > too. > > Thanks, > Thanu > > On Mon, Mar 3, 2014 at 11:43 AM, Chris Mason <clm@fb.com> wrote: > > On 02/25/2014 09:01 PM, thanumalayan mad wrote: > >> > >> Hi all, > >> > >> Slightly complicated question. > >> > >> Assume I do two directory operations in a Btrfs partition (such as an > >> unlink() and a rename()), one after the other, and a crash happens > >> after the rename(). Can Btrfs (the current version) send the second > >> operation to the disk first, so that after the crash, I observe the > >> effects of rename() but not the effects of the unlink()? > >> > >> I think I am observing Btrfs re-ordering an unlink() and a rename(), > >> and I just want to confirm that my observation is true. Also, if Btrfs > >> does send directory operations to disk out of order, is there some > >> limitation on this? Like, is this restricted to only unlink() and > >> rename()? > >> > >> I am looking at some (buggy) applications that use Btrfs, and this > >> behavior seems to affect them. > > > > > > There isn't a single answer for this one. > > > > You might have > > > > Thread A: > > > > ulink(foo); > > rename(somefile, somefile2); > > <crash> > > > > This should always have the rename happen before or in the same transaction > > as the rename. > > > > Thread A: > > > > ulink(dirA/foo); > > rename(dirB/somefile, dirB/somefile2); > > > > Here you're at the mercy of what is happening in dirB. If someone fsyncs > > that directory, it may hit the disk before the unlink. > > > > Thread A: > > > > ulink(foo); > > rename(somefile, somefile2); > > fsync(somefile); > > > > This one is even fuzzier. Backrefs allow us to do some file fsyncs without > > touching the directory, making it possible the unlink will hit disk after > > the fsync. > > > > -chris As I understand it POSIX only garanties that the in-core data is updated by the syscalls in-order. On crash anything can happen. If the application needs something to be commited to disk then it needs to fsync(). Specifically it needs to fsync() the changed files AND directories. >From man fsync: Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. So the fsync(somefile) above doesn't necessarily force the rename to disk. My experience with fuse tells me that at least fuse handles operations in parallel and only blocks a later operation if it is affected by an earlier operation. An unlink in one directory can (and will) run in parallel to a rename in another directory. Then, depending on how threads get scheduled, the rename can complete before the unlink. My conclusion is that you need to fsync() the directory to ensure the metadata update has made it to the disk if you require that. Otherwise you have to be able to cope with (meta)data loss on crash. Note: https://code.google.com/p/leveldb/issues/detail?id=189 talks a lot about journaling and that any yournaling filesystem should preserve the order. I think that is rather pointless for two reasons: 1) The journal gets replayed after a crash so in whatever order the two journal entries are written doesn't matter. They both make it to disk. You can't see one without the other. This is assuming you fsync()ed the dirs so force the metadata change into the journal in the first place. 2) btrfs afaik doesn't have any journal since COW already garanties atomic updates and crash protection. Overall I also think the fear of fsync() is overrated for this issue. This would only happen on programm start or whenever you open a database. Not somthing that happens every second. MfG Goswin ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-03-13 10:01 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-02-26 2:01 Ordering of directory operations maintained across system crashes in Btrfs? thanumalayan mad 2014-03-03 17:30 ` thanumalayan mad 2014-03-03 17:43 ` Chris Mason 2014-03-03 17:56 ` thanumalayan mad 2014-03-13 10:01 ` Goswin von Brederlow
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox