* repo consistency under crashes and power failures? @ 2013-07-15 17:48 Greg Troxel 2013-07-15 17:51 ` Jonathan Nieder ` (2 more replies) 0 siblings, 3 replies; 4+ messages in thread From: Greg Troxel @ 2013-07-15 17:48 UTC (permalink / raw) To: git [-- Attachment #1: Type: text/plain, Size: 998 bytes --] Clearly there is the possibility of creating a corrupt repository when receiving objects and updating refs, if a crash or power failure causes data not to get written to disk but that data is pointed to. Journaling mitigates this, but I'd argue that programs should function safely with only the guarantees from POSIX. I am curious if anyone has actual experiences to share, either a report of corruption after a crash (where corruption means that either 1) git fsck reports worse than dangling objects or 2) some ref did not either point to the old place or the new place) experiments intended to provoke corruption, like dropping power during pushes, or forced panics in the kernel due to timers, etc. Alternatively, is there somewhere a first-principles analysis vs POSIX specs (such as fsyncing object files before updating refs to point to them, which I realize has performance negatives)? (I have not done experiments, but have observed no corruption.) Thanks, Greg [-- Attachment #2: Type: application/pgp-signature, Size: 194 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: repo consistency under crashes and power failures? 2013-07-15 17:48 repo consistency under crashes and power failures? Greg Troxel @ 2013-07-15 17:51 ` Jonathan Nieder 2013-07-16 6:17 ` Johannes Sixt 2013-07-27 3:10 ` Jeff King 2 siblings, 0 replies; 4+ messages in thread From: Jonathan Nieder @ 2013-07-15 17:51 UTC (permalink / raw) To: Greg Troxel; +Cc: git Greg Troxel wrote: > Alternatively, is there somewhere a first-principles analysis vs POSIX > specs (such as fsyncing object files before updating refs to point to > them, which I realize has performance negatives)? You might be interested in the 'core.fsyncobjectfiles' setting. git-config(1) has details. Thanks and hope that helps, Jonathan ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: repo consistency under crashes and power failures? 2013-07-15 17:48 repo consistency under crashes and power failures? Greg Troxel 2013-07-15 17:51 ` Jonathan Nieder @ 2013-07-16 6:17 ` Johannes Sixt 2013-07-27 3:10 ` Jeff King 2 siblings, 0 replies; 4+ messages in thread From: Johannes Sixt @ 2013-07-16 6:17 UTC (permalink / raw) To: Greg Troxel; +Cc: git Am 7/15/2013 19:48, schrieb Greg Troxel: > Clearly there is the possibility of creating a corrupt repository when > receiving objects and updating refs, if a crash or power failure causes > data not to get written to disk but that data is pointed to. Journaling > mitigates this, but I'd argue that programs should function safely with > only the guarantees from POSIX. Even under POSIX, "guarantees" and "crash/power failure" do not mesh well. This has been under dispute recently, for example: http://thread.gmane.org/gmane.comp.standards.posix.austin.general/7456/focus=7487 The best we can achieve with POSIX alone is "to make bad consequences less likely". Jonathan already mentioned the knob that allows you to trade performance for more safety. -- Hannes ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: repo consistency under crashes and power failures? 2013-07-15 17:48 repo consistency under crashes and power failures? Greg Troxel 2013-07-15 17:51 ` Jonathan Nieder 2013-07-16 6:17 ` Johannes Sixt @ 2013-07-27 3:10 ` Jeff King 2 siblings, 0 replies; 4+ messages in thread From: Jeff King @ 2013-07-27 3:10 UTC (permalink / raw) To: Greg Troxel; +Cc: git On Mon, Jul 15, 2013 at 01:48:23PM -0400, Greg Troxel wrote: > I am curious if anyone has actual experiences to share, either > > a report of corruption after a crash (where corruption means that > either 1) git fsck reports worse than dangling objects or 2) some ref > did not either point to the old place or the new place) > > experiments intended to provoke corruption, like dropping power during > pushes, or forced panics in the kernel due to timers, etc. I have quite a bit of experience with this, as I investigate all repo corruption that we see on github.com, and have run experiments to try to reproduce such corruption. Our backend git systems are ext3 with journaling and data=ordered. We run that on top of drbd, with two redundant machines sharing the block device. If one dies, we fail over to the spare. Writes to the block device are not considered committed until they are written to both machines. Git's scheme is to write objects (both loose and when receiving packs over the wire) via tempfile, with an atomic link-into-place after close. We do not fsync object files by default, but we do fsync packs. However, it shouldn't matter as long as your filesystem orders data and metadata writes (if it doesn't, you probably want to turn on object fsyncing). So for our data=ordered filesystems, that's fine. Ref writes have a similar fsync situation to loose object files. We write the new ref to a tempfile, close, and then rename into place. If the data and metadata writes are out of order, one could have problems (but again, not a problem with data=ordered). Most of the corruption we have seen at GitHub has been one of: 1. Buggy non-core-git implementations that do not properly use tempfiles to create objects (Grit used to have this problem, but it is now fixed). 2. Race conditions in examining ref state that can cause refs to be missed when determining reachability (thus you might prune objects that should be left). The worst of these is fixed in the current "master" and will be part of git v1.8.4. There are still ways that we can prune too much, but they are reasonably unlikely unless you are pruning constantly. We did once experience some lost objects after a server failover. After much experimentation, we finally found out that the machine in question had a RAID card with bad memory which would drop some writes which it claimed to have committed after a power failure (so even fsync did not help). So for ordered data and metadata writes, in my experience git is quite solid against power failures and crashes. For systems without that guarantee, you should turn on core.fsyncobjectfiles, but I suspect you could also see some ref corruption (and possibly index corruption, too, as it does not fsync either). -Peff ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-07-27 3:10 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-07-15 17:48 repo consistency under crashes and power failures? Greg Troxel 2013-07-15 17:51 ` Jonathan Nieder 2013-07-16 6:17 ` Johannes Sixt 2013-07-27 3:10 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).