All of lore.kernel.org
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
Date: Sat, 26 Jan 2013 22:32:42 -0800	[thread overview]
Message-ID: <7vlibfxhit.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <20130126224011.GA20675@sigill.intra.peff.net> (Jeff King's message of "Sat, 26 Jan 2013 17:40:11 -0500")

Jeff King <peff@peff.net> writes:

> This is a repost from here:
>
>   http://thread.gmane.org/gmane.comp.version-control.git/211176
>
> which got no response initially. Basically the issue is that read-only
> repos (e.g., a CI server) whose workflow is something like:
>
>   git fetch $some_branch &&
>   git checkout -f $some_branch &&
>   make test
>
> will never run git-gc, and will accumulate a bunch of small packs and
> loose objects, leading to poor performance.
>
> Patch 1 runs "gc --auto" on fetch, which I think is sane to do.
>
> Patch 2 optimizes our pack dir re-scanning for fetch-pack (which, unlike
> the rest of git, should expect to be missing lots of objects, since we
> are deciding what to fetch).
>
> I think 1 is a no-brainer. If your repo is packed, patch 2 matters less,
> but it still seems like a sensible optimization to me.
>
>   [1/2]: fetch: run gc --auto after fetching
>   [2/2]: fetch-pack: avoid repeatedly re-scanning pack directory
>
> -Peff

Both makes sense to me.

I also wonder if we would be helped by another "repack" mode that
coalesces small packs into a single one with minimum overhead, and
run that often from "gc --auto", so that we do not end up having to
have 50 packfiles.

When we have 2 or more small and young packs, we could:

 - iterate over idx files for these packs to enumerate the objects
   to be packed, replacing read_object_list_from_stdin() step;

 - always choose to copy the data we have in these existing packs,
   instead of doing a full prepare_pack(); and

 - use the order the objects appear in the original packs, bypassing
   compute_write_order().

The procedure cannot be a straight byte-for-byte copy, because some
objects may appear in multiple packs, and extra copies of the same
object have to be excised from the result.  OFS_DELTA offsets need
to be adjusted for objects that appear later in the output and for
objects that were deltified against such an object that recorded its
base with OFS_DELTA format.

But other than such OFS_DELTA adjustments, it feels that such an
"only coalesce multiple packs into one" mode should be fairly quick.

  parent reply	other threads:[~2013-01-27  6:33 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-26 22:40 [PATCH 0/2] optimizing pack access on "read only" fetch repos Jeff King
2013-01-26 22:40 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
2013-01-27  1:51   ` Jonathan Nieder
     [not found]   ` <87bmopzbqx.fsf@gmail.com>
2017-07-12 20:00     ` git gc --auto aquires *.lock files that make a subsequent git-fetch error out Jeff King
2017-07-12 20:30       ` Ævar Arnfjörð Bjarmason
2017-07-12 20:43         ` Jeff King
2013-01-26 22:40 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
2013-01-27 10:27   ` Jonathan Nieder
2013-01-27 20:09     ` Junio C Hamano
2013-01-27 23:20       ` Jonathan Nieder
2013-01-27  6:32 ` Junio C Hamano [this message]
2013-01-29  8:06   ` [PATCH 0/2] optimizing pack access on "read only" fetch repos Shawn Pearce
2013-01-29  8:29   ` Jeff King
2013-01-29 15:25     ` Martin Fick
2013-01-29 15:58     ` Junio C Hamano
2013-01-29 21:19       ` Jeff King
2013-01-29 22:26         ` Junio C Hamano
2013-01-31 16:47         ` Shawn Pearce
2013-02-01  9:14           ` Jeff King
2013-02-02 10:07             ` Shawn Pearce
2013-01-29 11:01   ` Duy Nguyen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vlibfxhit.fsf@alter.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.