All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Richard Hipp <drh@sqlite.org>,
	git@vger.kernel.org, Mike Hommey <mh@glandium.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: git-fast-import yields huge packfile
Date: Thu, 21 Mar 2019 15:23:15 +0100	[thread overview]
Message-ID: <87o964cnn0.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <nycvar.QRO.7.76.6.1903211503030.41@tvgsbejvaqbjf.bet>


On Thu, Mar 21 2019, Johannes Schindelin wrote:

> Hi Richard,
>
> On Sat, 16 Mar 2019, Richard Hipp wrote:
>
>> I'm trying to transform a repository from another VCS into a Git
>> repository using "git fast-import".  It appears to work, but the
>> resulting Git repository is huge relative to the original - 18 times
>> larger. Most of the space seems to be taken up by a single large
>> packfile.  That packfile is about 967 MB which is about 1/4th the
>> total uncompressed size of all 41785 distinct Blobs in the original
>> repository.  The source VCS is able to compress this down to 52 MB by
>> comparison.
>
> I feel your pain, as I had the same problem back in the day. My use case
> was mirroring an upstream Mercurial repository to a Git repository. This
> use case went away, so I do not do that anymore (and there are more, less
> happy reasons why I would no longer work on that git-remote-hg project,
> but that's off topic). As one of the last rem(a)inders, Git for Windows
> carries this patch:
>
> https://github.com/git-for-windows/git/commit/b91911ff8d3e2cf279b4708be89de2e3bc8e9e87
>
> Essentially, it *always* runs `git gc --auto` after running `fast-import`.
>
> Which is a lot more high-level advice than the rather low-level `git
> repack` hint given elsewhere in this thread.
>
> Now, I wonder whether we should integrate this into `fast-import` proper
> (with a knob to turn it off), maybe even offer to run `git gc --auto`
> every <N> imported commits?

My reading of the combination of Linus's & Mike Hommey's E-Mails is that
this just happened to work for you because the blob import order you
used was such that you didn't get any on-the-fly deltas.

But as Linus notes you need to pass "-f" aka. "--no-reuse-delta" down to
pack-objects for this to work in the general case, so a plain "git gc"
in that GFW patch won't do the right thing *unless* you didn't end up
with any deltas at all (or close enough for it not to matter).

So in the general case you need to run "git gc --aggressive" after a
"fast-import". I'll add some docs about this in my re-roll of my
concurrent gc doc series:
https://public-inbox.org/git/20190318161502.7979-1-avarab@gmail.com/

I wonder if we should just leave it at that. The fast-import command is
plumbing, and e.g. someone running N number of those now and doing a
"git gc --aggressive" afterwards would have their use broken by this,
their "gc" would abort if the "--aggressive" we spawned after the 1st
fast-import invocation was still running.

I was thinking of introducing some sub-mode for --aggressive that
doesn't tweak the window size, but just passes down "-f". It would more
generally cover these cases, and eta less CPU than the increased window
size (although "--no-reuse-delta" by itself is very expensive).


>> Maybe I'm doing something wrong with the fast-import stream that is
>> defeating Git's attempts at delta compression....
>>
>> Are there any utility programs available for analyzing packfiles so
>> that I try to figure out where the inefficiencies are cropping up, so
>> that I can try to address them?
>>
>> Anybody have any suggestions on what I should be looking for?
>>
>> If anyone would care to see this oversized packfile and perhaps offer
>> suggestions on how I can make it more space-efficient, it can be
>> cloned from https://github.com/drhsqlite/fossil-mirror.git - at least
>> for now - surely I will delete that repo and regenerate it once I
>> figure out this problem.
>>
>> --
>> D. Richard Hipp
>> drh@sqlite.org
>>

      reply	other threads:[~2019-03-21 14:23 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-16 20:31 git-fast-import yields huge packfile Richard Hipp
2019-03-16 21:04 ` Linus Torvalds
2019-03-16 22:12   ` Mike Hommey
2019-03-16 23:22   ` Richard Hipp
2019-03-21 14:09 ` Johannes Schindelin
2019-03-21 14:23   ` Ævar Arnfjörð Bjarmason [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o964cnn0.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=drh@sqlite.org \
    --cc=git@vger.kernel.org \
    --cc=mh@glandium.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.