repo consistency under crashes and power failures?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* repo consistency under crashes and power failures?
@ 2013-07-15 17:48 Greg Troxel
  2013-07-15 17:51 ` Jonathan Nieder
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Greg Troxel @ 2013-07-15 17:48 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 998 bytes --]

Clearly there is the possibility of creating a corrupt repository when
receiving objects and updating refs, if a crash or power failure causes
data not to get written to disk but that data is pointed to.  Journaling
mitigates this, but I'd argue that programs should function safely with
only the guarantees from POSIX.

I am curious if anyone has actual experiences to share, either

  a report of corruption after a crash (where corruption means that
  either 1) git fsck reports worse than dangling objects or 2) some ref
  did not either point to the old place or the new place)

  experiments intended to provoke corruption, like dropping power during
  pushes, or forced panics in the kernel due to timers, etc.

Alternatively, is there somewhere a first-principles analysis vs POSIX
specs (such as fsyncing object files before updating refs to point to
them, which I realize has performance negatives)?

(I have not done experiments, but have observed no corruption.)

    Thanks,
    Greg

[-- Attachment #2: Type: application/pgp-signature, Size: 194 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: repo consistency under crashes and power failures?
  2013-07-15 17:48 repo consistency under crashes and power failures? Greg Troxel
@ 2013-07-15 17:51 ` Jonathan Nieder
  2013-07-16  6:17 ` Johannes Sixt
  2013-07-27  3:10 ` Jeff King
  2 siblings, 0 replies; 4+ messages in thread
From: Jonathan Nieder @ 2013-07-15 17:51 UTC (permalink / raw)
  To: Greg Troxel; +Cc: git

Greg Troxel wrote:

> Alternatively, is there somewhere a first-principles analysis vs POSIX
> specs (such as fsyncing object files before updating refs to point to
> them, which I realize has performance negatives)?

You might be interested in the 'core.fsyncobjectfiles' setting.
git-config(1) has details.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: repo consistency under crashes and power failures?
  2013-07-15 17:48 repo consistency under crashes and power failures? Greg Troxel
  2013-07-15 17:51 ` Jonathan Nieder
@ 2013-07-16  6:17 ` Johannes Sixt
  2013-07-27  3:10 ` Jeff King
  2 siblings, 0 replies; 4+ messages in thread
From: Johannes Sixt @ 2013-07-16  6:17 UTC (permalink / raw)
  To: Greg Troxel; +Cc: git

Am 7/15/2013 19:48, schrieb Greg Troxel:
> Clearly there is the possibility of creating a corrupt repository when
> receiving objects and updating refs, if a crash or power failure causes
> data not to get written to disk but that data is pointed to.  Journaling
> mitigates this, but I'd argue that programs should function safely with
> only the guarantees from POSIX.

Even under POSIX, "guarantees" and "crash/power failure" do not mesh well.
This has been under dispute recently, for example:

http://thread.gmane.org/gmane.comp.standards.posix.austin.general/7456/focus=7487

The best we can achieve with POSIX alone is "to make bad consequences less
likely".

Jonathan already mentioned the knob that allows you to trade performance
for more safety.

-- Hannes

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: repo consistency under crashes and power failures?
  2013-07-15 17:48 repo consistency under crashes and power failures? Greg Troxel
  2013-07-15 17:51 ` Jonathan Nieder
  2013-07-16  6:17 ` Johannes Sixt
@ 2013-07-27  3:10 ` Jeff King
  2 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2013-07-27  3:10 UTC (permalink / raw)
  To: Greg Troxel; +Cc: git

On Mon, Jul 15, 2013 at 01:48:23PM -0400, Greg Troxel wrote:

> I am curious if anyone has actual experiences to share, either
> 
>   a report of corruption after a crash (where corruption means that
>   either 1) git fsck reports worse than dangling objects or 2) some ref
>   did not either point to the old place or the new place)
> 
>   experiments intended to provoke corruption, like dropping power during
>   pushes, or forced panics in the kernel due to timers, etc.

I have quite a bit of experience with this, as I investigate all repo
corruption that we see on github.com, and have run experiments to try to
reproduce such corruption.

Our backend git systems are ext3 with journaling and data=ordered. We
run that on top of drbd, with two redundant machines sharing the block
device. If one dies, we fail over to the spare. Writes to the block
device are not considered committed until they are written to both
machines.

Git's scheme is to write objects (both loose and when receiving packs
over the wire) via tempfile, with an atomic link-into-place after close.
We do not fsync object files by default, but we do fsync packs. However,
it shouldn't matter as long as your filesystem orders data and metadata
writes (if it doesn't, you probably want to turn on object fsyncing).
So for our data=ordered filesystems, that's fine.

Ref writes have a similar fsync situation to loose object files. We
write the new ref to a tempfile, close, and then rename into place. If
the data and metadata writes are out of order, one could have problems
(but again, not a problem with data=ordered).

Most of the corruption we have seen at GitHub has been one of:

  1. Buggy non-core-git implementations that do not properly use
     tempfiles to create objects (Grit used to have this problem, but it
     is now fixed).

  2. Race conditions in examining ref state that can cause refs to be
     missed when determining reachability (thus you might prune objects
     that should be left). The worst of these is fixed in the current
     "master" and will be part of git v1.8.4. There are still ways that
     we can prune too much, but they are reasonably unlikely unless you
     are pruning constantly.

We did once experience some lost objects after a server failover.  After
much experimentation, we finally found out that the machine in question
had a RAID card with bad memory which would drop some writes which it
claimed to have committed after a power failure (so even fsync did not
help).

So for ordered data and metadata writes, in my experience git is quite
solid against power failures and crashes. For systems without that
guarantee, you should turn on core.fsyncobjectfiles, but I suspect you
could also see some ref corruption (and possibly index corruption, too,
as it does not fsync either).

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-07-27  3:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-15 17:48 repo consistency under crashes and power failures? Greg Troxel
2013-07-15 17:51 ` Jonathan Nieder
2013-07-16  6:17 ` Johannes Sixt
2013-07-27  3:10 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).