git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeremy Maitin-Shepard <jbms@cmu.edu>
To: Junio C Hamano <gitster@pobox.com>
Cc: Nicolas Pitre <nico@cam.org>,
	Brandon Casey <casey@nrlssc.navy.mil>,
	Geert Bosch <bosch@adacore.com>, Jeff King <peff@peff.net>,
	git@vger.kernel.org
Subject: Re: git gc & deleted branches
Date: Fri, 09 May 2008 20:07:44 -0400	[thread overview]
Message-ID: <877ie3yqb3.fsf@jeremyms.com> (raw)
In-Reply-To: <7vwsm39kft.fsf@gitster.siamese.dyndns.org> (Junio C. Hamano's message of "Fri, 09 May 2008 15:33:41 -0700")

Junio C Hamano <gitster@pobox.com> writes:

> Nicolas Pitre <nico@cam.org> writes:
>> On Fri, 9 May 2008, Brandon Casey wrote:
>> 
>>> Unreferenced objects are sometimes used by other repositories which have
>>> this repository listed as an alternate. So it may not be a good idea to
>>> make the unreferenced objects inaccessible.
>> 
>> Nah.  If this is really the case then you shouldn't be running gc at all 
>> in the first place.

> True.

> I think the true motivation behind --keep-unreachable is not about the
> shared object store (aka "alternates") but about races between gc and
> push (or fetch).  Before push (or fetch) finishes and updates refs, the
> new objects they create would be dangling _and_ the objects these dangling
> objects refer to may be packed but unreferenced.  Repacking unreferenced
> packed objects was a way to avoid losing them.

I feel like the current approach of (not very well) keeping track of
which objects are still needed is very messy, not very well defined or
based on specific solid principles, and prone to errors and losing
objects.

Things like git clone -shared can only really be used in extremely
specialized setups, or if pruning of unreferenced objects is completely
disabled in the source repository, or if specialized scripts are used to
do the garbage collection that take into account the references of the
"child" repository.  It is my impression that even repo.or.cz, while it
has some safe guards, does not even completely safely handle garbage
collection.  Probably it would be very useful to examples of such
scripts in contrib.

I think that ultimately, some general purpose and reliable solution
needs to be found to handle the cases of (1) a repository having its
objects referenced by another via info/alternates; (2) a repository with
multiple working directories (presumably this should warn/error out
unless given a force option/detach head and warn if you try to switch
HEAD for some working directory to the same branch as some other working
directory).  It seems, btw, that a third type of clone, one which merely
symlinks the objects directory, would also be useful, once there is a
solution to the robustness issue.  This would be a case (3) that needs
to be handled as well.

It seems that clear that ultimately, to handle these three cases, every
repository needs to know about every other repository, probably via a
symlink to other repository's .git directory.  Git gc would then also
examine any refs in this directory, making sure to avoid circular
references that might result from following the symlinks.  It should
also probably error out if it finds a symlink that doesn't point to a
valid git repository, because such a symlink either refers to a
now-deleted repository for which the symlink needs to be cleaned up, or
it refers to a repository that was moved and therefore the symlink needs
to be updated.  Simply ignoring invalid symlinks could result in pruning
objects that need to be kept for repositories that have moved.

It is extremely cumbersome to have to worry about whether there are
other concurrent accesses to the repository when running e.g. git gc.
For servers, you may never be able to guarantee that nothing else is
accessing the repository concurrently.  Here is a possible solution:

Each git process creates a log file of the references that it has
created.  The log file should be named in some way with e.g. the process
id and start time of the process, and simply consist of a list of
20-byte sha1 hashes to be considered additional in-use references for
the purpose of garbage collection.  The log file would be cleaned up
when the process exits, and would also be deleted by any instance of git
gc that notices a stale log file that doesn't correspond to a running
process.  To handle shell scripts that need to deal with git-hash-object
directly, git hash-object could be passed maybe a file descriptor or
filename of a log file to use instead of creating one.  Maybe the log
file format could be more complicated, and also support paths to
e.g. alternate index files to also consider for references.  Things
would need to be one so that race conditions do not occur, but I think
something like this would work.

-- 
Jeremy Maitin-Shepard

  parent reply	other threads:[~2008-05-10  0:09 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-08 17:45 git gc & deleted branches Guido Ostkamp
2008-05-08 18:39 ` Jeff King
2008-05-08 18:55   ` Guido Ostkamp
2008-05-08 20:07     ` Brandon Casey
2008-05-08 20:52       ` Guido Ostkamp
2008-05-08 21:01         ` Jeff King
2008-05-08 21:15           ` Nicolas Pitre
2008-05-08 21:17             ` Jeff King
2008-05-08 21:23               ` Brandon Casey
2008-05-08 21:31                 ` Jeff King
2008-05-08 21:40                   ` Brandon Casey
2008-05-08 21:44                     ` Jeff King
2008-05-08 21:53                       ` Brandon Casey
2008-05-08 22:48                         ` Jeff King
2008-05-09  1:41                           ` Brandon Casey
2008-05-09  3:21                             ` Junio C Hamano
     [not found]                               ` <ee63ef30805082105w7f04a2d1y65a4618aeb787cac@mail.gmail.com>
     [not found]                                 ` <7v1w4bb291.fsf@gitster.siamese.dyndns.org>
2008-05-10  3:32                                   ` Brandon Casey
2008-05-10  4:15                                     ` Brandon Casey
2008-05-10  4:01                               ` [PATCH 0/3] leave unreferenced objects unpacked drafnel
2008-05-10  4:01                               ` [PATCH 1/3] repack: modify behavior of -A option to " drafnel
2008-05-10  6:03                                 ` Jeff King
2008-05-11  1:10                                   ` Nicolas Pitre
2008-05-11  1:23                                     ` Junio C Hamano
2008-05-11  4:16                                   ` Brandon Casey
2008-05-11  4:51                                     ` Brandon Casey
2008-05-10  4:01                               ` [PATCH 2/3] git-gc: always use -A when manually repacking drafnel
2008-05-10  4:01                               ` [PATCH 3/3] builtin-gc.c: deprecate --prune, it now really has no effect drafnel
2008-05-09  4:19                             ` git gc & deleted branches Jeff King
2008-05-09 15:00                               ` Geert Bosch
2008-05-09 15:14                                 ` Brandon Casey
2008-05-09 15:53                                   ` Jeff King
2008-05-09 15:56                                     ` Brandon Casey
2008-05-09 16:12                                   ` Nicolas Pitre
2008-05-09 16:54                                     ` Brandon Casey
2008-05-09 22:33                                     ` Junio C Hamano
2008-05-09 23:09                                       ` [PATCH] Updating documentation to match Brandon Casey's proposed git-repack patch Chris Frey
2008-05-10  0:07                                       ` Jeremy Maitin-Shepard [this message]
2008-05-10  0:20                                         ` git gc & deleted branches Shawn O. Pearce
2008-05-10  0:43                                           ` Jeremy Maitin-Shepard
2008-05-10  1:21                                           ` Junio C Hamano
2008-05-10  1:51                                             ` Jeremy Maitin-Shepard
2008-05-10  5:25                                               ` Jeff King
2008-05-10  5:36                                                 ` Jeremy Maitin-Shepard
2008-05-10  9:04                                                   ` Johannes Schindelin
2008-05-10 16:24                                                     ` Jeremy Maitin-Shepard
2008-05-11 11:11                                                       ` Johannes Schindelin
2008-05-11 18:39                                                         ` Junio C Hamano
2008-05-08 21:33           ` Guido Ostkamp
2008-05-08 20:56       ` Jeff King
2008-05-08 20:51     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877ie3yqb3.fsf@jeremyms.com \
    --to=jbms@cmu.edu \
    --cc=bosch@adacore.com \
    --cc=casey@nrlssc.navy.mil \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=nico@cam.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).