From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Duy Nguyen" <pclouds@gmail.com>,
"John Keeping" <john@keeping.me.uk>,
"Дилян Палаузов" <dilyan.palauzov@aegee.org>,
"Git Mailing List" <git@vger.kernel.org>
Subject: Re: git pull & git gc
Date: Thu, 19 Mar 2015 00:09:57 -0400 [thread overview]
Message-ID: <20150319040957.GA29437@peff.net> (raw)
In-Reply-To: <CAPc5daWmppS_PrvMurEUfvZ2c_bhMnLb-zmck0wzFpgJ4maxZw@mail.gmail.com>
On Wed, Mar 18, 2015 at 07:27:46PM -0700, Junio C Hamano wrote:
> > I guess leaving a bunch of loose objects around longer than necessary
> > isn't the end of the world. It wastes space, but it does not actively
> > make the rest of git slower (whereas having a large number of packs does
> > impact performance). So you could probably make "recent enough" be "T <
> > now - gc.pruneExpire / 4" or something. At most we would try to gc 4
> > times before dropping unreachable objects, and for the default period,
> > that's only once every couple days.
>
> We could simply prune unreachables more aggressively, and it would
> solve this issue at the root cause, no?
Yes, but not too aggressively. You mentioned object archaeology, but my
main interest is avoiding corruption. The mtime check is the thing that
prevents us from pruning objects being used for an operation-in-progress
that has not yet updated a ref. For some long-running operations, like
adding files to a commit, we take into account references like a blob
being mentioned in the index. But I do not know offhand if there are
other long-running operations that would run into problems if we
shortened the expiration time drastically. Anything building a
temporary index is potentially problematic.
But if we assume that operations like that tend to create and reference
their objects within a reasonable time period (say, seconds to minutes)
then the current default of 2 weeks is absurd for this purpose. For
raciness within a single operation, a few seconds is probably enough
(e.g., we may write out a commit object and then update the ref a few
milliseconds later).
The potential for problems is exacerbated by the fact that object `X`
may exist in the filesystem with an old mtime, and then a new operation
wants to reference it. That's made somewhat better by 33d4221
(write_sha1_file: freshen existing objects, 2014-10-15), as before we
could silently turn a file write into a noop. But it's still racy to do:
git cat-file -e $commit
git update-ref refs/heads/foo $commit
as we do not update the mtime for a read-only operation like cat-file
(and even if we did, it's still somewhat racy as prune does not
atomically check the mtime and remove the file).
So I think there's definitely some possible danger with dropping the
default prune expiration time.
For a long time GitHub ran with it as 1.hour.ago. We definitely saw some
oddities and corruption over the years that were apparently caused by
over-aggressive pruning and/or raciness. I've fixed a number of bugs,
and things did get better as a result. But I could not say whether all
such problems are gone. These days we do our regular repacks with
"--keep-unreachable" and almost never prune anything.
It's also not clear whether GitHub represents anything close to "normal"
use. We have a much smaller array of operations that we perform (most
objects are either from a push, or from a test-merge between a topic
branch and HEAD). But we also have busy repos that are frequently doing
gc in the background (especially because we share object storage, so
activity on another fork can trigger a gc job that affects a whole
repository network). On workstations, I'd guess most git-gc jobs run
during a fairly quiescent period.
All of which is to say that I don't really know the answer, and there
may be dragons. I'd imagine that dropping the default expiration time
from 2 weeks to 1 day would probably be fine. A good way to experiment
would be for some brave souls to set gc.pruneexpire themselves, run with
it for a few weeks or months, and see if anything goes wrong.
-Peff
next prev parent reply other threads:[~2015-03-19 4:10 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-18 13:53 git pull & git gc Дилян Палаузов
2015-03-18 14:16 ` Duy Nguyen
2015-03-18 14:23 ` Дилян Палаузов
2015-03-18 14:33 ` Duy Nguyen
2015-03-18 14:41 ` Duy Nguyen
2015-03-18 14:58 ` John Keeping
2015-03-18 21:04 ` Jeff King
2015-03-19 0:31 ` Duy Nguyen
2015-03-19 1:27 ` Jeff King
2015-03-19 2:01 ` Mike Hommey
2015-03-19 4:14 ` Jeff King
2015-03-19 4:26 ` Mike Hommey
2015-03-19 2:27 ` Junio C Hamano
2015-03-19 4:09 ` Jeff King [this message]
2015-03-19 4:15 ` Duy Nguyen
2015-03-19 4:20 ` Jeff King
2015-03-19 4:29 ` Duy Nguyen
2015-03-19 4:34 ` Jeff King
2015-03-19 9:47 ` Duy Nguyen
2015-03-18 14:48 ` Дилян Палаузов
2015-03-18 21:07 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150319040957.GA29437@peff.net \
--to=peff@peff.net \
--cc=dilyan.palauzov@aegee.org \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=john@keeping.me.uk \
--cc=pclouds@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).