git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, "avarab@gmail.com" <avarab@gmail.com>
Subject: ds/multi-pack-index (was Re: What's cooking in git.git (Jul 2018, #02; Wed, 18))
Date: Fri, 20 Jul 2018 09:42:13 -0400	[thread overview]
Message-ID: <c1a697ed-a060-1901-073f-7c8d5d5d0f10@gmail.com> (raw)
In-Reply-To: <xmqqtvowi4l3.fsf@gitster-ct.c.googlers.com>

On 7/18/2018 6:03 PM, Junio C Hamano wrote:
> * ds/multi-pack-index (2018-07-12) 23 commits
>   - midx: clear midx on repack
>   - packfile: skip loading index if in multi-pack-index
>   - midx: prevent duplicate packfile loads
>   - midx: use midx in approximate_object_count
>   - midx: use existing midx when writing new one
>   - midx: use midx in abbreviation calculations
>   - midx: read objects from multi-pack-index
>   - config: create core.multiPackIndex setting
>   - midx: write object offsets
>   - midx: write object id fanout chunk
>   - midx: write object ids in a chunk
>   - midx: sort and deduplicate objects from packfiles
>   - midx: read pack names into array
>   - multi-pack-index: write pack names in chunk
>   - multi-pack-index: read packfile list
>   - packfile: generalize pack directory list
>   - t5319: expand test data
>   - multi-pack-index: load into memory
>   - midx: write header information to lockfile
>   - multi-pack-index: add 'write' verb
>   - multi-pack-index: add builtin
>   - multi-pack-index: add format details
>   - multi-pack-index: add design document
>
>   When there are too many packfiles in a repository (which is not
>   recommended), looking up an object in these would require
>   consulting many pack .idx files; a new mechanism to have a single
>   file that consolidates all of these .idx files is introduced.
>
>   What's the doneness of this one?  I vaguely recall that there was
>   an objection against the concept as a whole (i.e. there is a way
>   with less damage to gain the same object-abbrev performance); has
>   it (and if anything else, they) been resolved in satisfactory
>   fashion?

I believe you're talking about Ævar's patch series [1] on unconditional 
abbreviation lengths. His patch gets similar speedups by completely 
eliminating the abbreviation computation in favor of a relative increase 
that is very likely to avoid collisions. While abbreviation speedups are 
the most dramatic measurable improvement by the multi-pack-index 
feature, it is not the only important feature.

Lookup speeds improve in a multi-pack environment. While only the 
largest of largest repos have trouble repacking into a single pack, 
there are many scenarios where users disable auto-gc and do not repack 
frequently. On-premise build machines are the ones I know about the 
most: these machines are run 24/7 to perform incremental fetches against 
a remote and kick off a build. Admins frequently turn off GC so the 
build times are not impacted. Eventually, their performance does degrade 
due to the number of packfiles. The answer we give to them is to set up 
scheduled maintenance to repack. These users don't need the space 
savings of a repack, but just need consistent performance and high 
up-time. The multi-pack-index could assist here (as long as we set up 
auto-computing the multi-pack-index after a fetch).

That's the best I can do to sell the feature as it stands now (plus the 
'fsck' integration that would follow after this series is accepted).

I have mentioned the potential for the multi-pack-index to do the following:

* Store metadata about the packfiles, possibly replacing the .keep and 
.promisor files, and allowing other extensions to inform repack algorithms.

* Store a stable object order, allowing the reachability bitmap to be 
computed at a different cadence from repacking the packfiles.

I'm interested in these applications, but I will admit that they are not 
on the top of my priority list at the moment. Right now, I'm focused on 
reaching feature parity with the version of the MIDX we have in our GVFS 
fork of Git, and then extending the feature to have incremental 
multi-pack-index files to solve the "big write" problem.

Thanks,

-Stolee

[1] 
https://public-inbox.org/git/20180608224136.20220-1-avarab@gmail.com/T/#u

      [PATCH 00/20] unconditional O(1) SHA-1 abbreviation


  parent reply	other threads:[~2018-07-20 13:42 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-18 22:03 What's cooking in git.git (Jul 2018, #02; Wed, 18) Junio C Hamano
2018-07-18 23:41 ` Stefan Beller
2018-07-19 16:33   ` Junio C Hamano
2018-07-19  6:10 ` Оля Тележная
2018-07-19  8:48 ` Eric Sunshine
2018-07-19 16:36   ` Junio C Hamano
2018-07-19 17:10 ` Elijah Newren
2018-07-20 13:42 ` Derrick Stolee [this message]
2018-07-20 16:09   ` ds/multi-pack-index (was Re: What's cooking in git.git (Jul 2018, #02; Wed, 18)) Junio C Hamano
2018-07-20 16:28     ` Derrick Stolee
2018-07-20 17:48 ` ag/rebase-i-in-c, was Re: What's cooking in git.git (Jul 2018, #02; Wed, 18) Alban Gruin
2018-07-23 18:21 ` Jonathan Tan
2018-07-24 19:31   ` Junio C Hamano
2018-07-24 19:38     ` Brandon Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c1a697ed-a060-1901-073f-7c8d5d5d0f10@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).