git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ian Campbell <ijc@hellion.org.uk>
To: gitster@pobox.com
Cc: git@vger.kernel.org
Subject: Re: [PATCH v2 4/4] Subject: filter-branch: stash away ref map in a branch
Date: Sun, 17 Sep 2017 10:43:23 +0100	[thread overview]
Message-ID: <1505641403.22447.6.camel@hellion.org.uk> (raw)
In-Reply-To: <20170917073657.31193-4-ijc@hellion.org.uk>

On Sun, 2017-09-17 at 08:36 +0100, Ian Campbell wrote:
> +if test -n "$state_branch"
> +then
> > +	echo "Saving rewrite state to $state_branch" 1>&2
> > +	state_blob=$(
> > +		perl -e'opendir D, "../map" or die;
> > +			open H, "|-", "git hash-object -w --stdin" or die;
> > +			foreach (sort readdir(D)) {
> > +				next if m/^\.\.?$/;
> > +				open F, "<../map/$_" or die;
> > +				chomp($f = <F>);
> > +				print H "$_:$f\n" or die;
> > +			}
> > +			close(H) or die;' || die "Unable to save state")

One things I've noticed is that for a full Linux tree history the
filter.map file is 50M+ which causes github to complain:

    remote: warning: File filter.map is 54.40 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB

(you can simulate this with `git log --pretty=format:"%H:%H"
upstream/master`.) I suppose that's not a bad recommendation for any
infra, not just GH's.

The blob is compressed in the object store so there isn't _much_ point
in compressing the map (also, it only goes down to ~30MB anyway so we
aren't buying all that much time), but I'm wondering if perhaps I
should look into a more intelligent representation, perhaps hashed by
the first two characters (as .git/objects is) to divide into several
blobs and have two levels.

I'm also wondering if the .git-rewrite/map directory, which will have
70k+ (and growing) directory entries for a modern Linux tree, would
benefit from the same sort of thing. OTOH in this case the extra shell
machinations to turn abcdef123 into ab/cdef123 might overwhelm the
savings in directory lookup time (unless there is a helper already for
that. That assume that directory lookup is even a bottleneck, I've not
measured but anecdotally/gut-feeling the commits-per-second does seem
to be decreasing over the course of the filtering process.

Ian.

      reply	other threads:[~2017-09-17  9:43 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-17  7:36 [PATCH v2 0/4] filter-branch: support for incremental update + fix for ancient tag format Ian Campbell
2017-09-17  7:36 ` [PATCH v2 1/4] mktag: add option which allows the tagger field to be omitted Ian Campbell
2017-09-19  3:01   ` Junio C Hamano
2017-09-19  6:42     ` Ian Campbell
2017-09-17  7:36 ` [PATCH v2 2/4] filter-branch: reset $GIT_* before cleaning up Ian Campbell
2017-09-17  7:36 ` [PATCH v2 3/4] filter-branch: preserve and restore $GIT_AUTHOR_* and $GIT_COMMITTER_* Ian Campbell
2017-09-17  7:36 ` [PATCH v2 4/4] Subject: filter-branch: stash away ref map in a branch Ian Campbell
2017-09-17  9:43   ` Ian Campbell [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1505641403.22447.6.camel@hellion.org.uk \
    --to=ijc@hellion.org.uk \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).