git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Enrico Weigelt <enrico.weigelt@vnc.biz>
To: git list <git@vger.kernel.org>
Subject: filter-branch IO optimization
Date: Thu, 11 Oct 2012 17:39:47 +0200 (CEST)	[thread overview]
Message-ID: <fa1e05a5-54b3-47ff-bd28-dc463ebbc4bd@zcs> (raw)
In-Reply-To: <7e000a0f-9e4e-4a4d-a8ce-5d017e17939c@zcs>

Hi folks,

for certain projects, I need to regularily run filter-branch on quite
large repos (>10k commits), and that needs to be run multiple times,
which takes several hours, so I'm looking for optimizations.

The main goal of this filtering is splitting out many modules from a
large upstream repo into their own downstream repos. This process
should be fully deterministic (IOW: running it twice at the same input,
should produce exactly same output, so commit IDs stay the same after
subsequent runs)

My current approach is most likely yet a bit too naive:

#1: forkoff new branch from current upstream
#2: run a tree-filter which:
    * removes all files not belonging to the wanted module
    * move the module directory under another subdir (./addons/)
    * fix author/comitter name/email if empty (because otherwise fails)
    * fix charater sets and indentions of source files
#3: loop through `git filter-branch --prune-empty` to get rid of empty
    merge nodes (which otherwise remain really a lot), until branch
    remains unchanged
#4: run plain rebase onto initial commit to linearize the history

All that is done is on per-module basis (for now only about 10,
but soon can become much more).

One thing I haven't tried yet is using the -d option to move the .git-rewrite
dir to an tmpfs (have to clarify some operating considerations first) ;-o

The next step I have in mind is using --subdirectory-filter, but open
questsions are:

* does it suffer from the same problems w/ empty username/email like --tree-filter ?
** if yes: what can I do about it (have an additional pass for fixing that before
   running the --tree-filter ?
* can I somehow teach the --subdirectory filter to place the result under some
  somedir instead of directly to root ?
* can I use --tree-filter in combination with --subdireectory-filter ? 
  which one is executed first ?


thanks
-- 
Mit freundlichen Grüßen / Kind regards 

Enrico Weigelt 
VNC - Virtual Network Consult GmbH 
Head Of Development 

Pariser Platz 4a, D-10117 Berlin
Tel.: +49 (30) 3464615-20
Fax: +49 (30) 3464615-59

enrico.weigelt@vnc.biz; www.vnc.de 

       reply	other threads:[~2012-10-11 15:40 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <7e000a0f-9e4e-4a4d-a8ce-5d017e17939c@zcs>
2012-10-11 15:39 ` Enrico Weigelt [this message]
2012-10-11 18:36   ` filter-branch IO optimization Johannes Sixt
2012-10-11 20:34   ` Thomas Rast
2012-10-12 14:49     ` Enrico Weigelt
2012-10-12 15:59       ` Enrico Weigelt
2012-10-12 17:20         ` Enrico Weigelt
2012-10-12 17:20       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa1e05a5-54b3-47ff-bd28-dc463ebbc4bd@zcs \
    --to=enrico.weigelt@vnc.biz \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).