git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* filter-branch IO optimization
       [not found] <7e000a0f-9e4e-4a4d-a8ce-5d017e17939c@zcs>
@ 2012-10-11 15:39 ` Enrico Weigelt
  2012-10-11 18:36   ` Johannes Sixt
  2012-10-11 20:34   ` Thomas Rast
  0 siblings, 2 replies; 7+ messages in thread
From: Enrico Weigelt @ 2012-10-11 15:39 UTC (permalink / raw)
  To: git list

Hi folks,

for certain projects, I need to regularily run filter-branch on quite
large repos (>10k commits), and that needs to be run multiple times,
which takes several hours, so I'm looking for optimizations.

The main goal of this filtering is splitting out many modules from a
large upstream repo into their own downstream repos. This process
should be fully deterministic (IOW: running it twice at the same input,
should produce exactly same output, so commit IDs stay the same after
subsequent runs)

My current approach is most likely yet a bit too naive:

#1: forkoff new branch from current upstream
#2: run a tree-filter which:
    * removes all files not belonging to the wanted module
    * move the module directory under another subdir (./addons/)
    * fix author/comitter name/email if empty (because otherwise fails)
    * fix charater sets and indentions of source files
#3: loop through `git filter-branch --prune-empty` to get rid of empty
    merge nodes (which otherwise remain really a lot), until branch
    remains unchanged
#4: run plain rebase onto initial commit to linearize the history

All that is done is on per-module basis (for now only about 10,
but soon can become much more).

One thing I haven't tried yet is using the -d option to move the .git-rewrite
dir to an tmpfs (have to clarify some operating considerations first) ;-o

The next step I have in mind is using --subdirectory-filter, but open
questsions are:

* does it suffer from the same problems w/ empty username/email like --tree-filter ?
** if yes: what can I do about it (have an additional pass for fixing that before
   running the --tree-filter ?
* can I somehow teach the --subdirectory filter to place the result under some
  somedir instead of directly to root ?
* can I use --tree-filter in combination with --subdireectory-filter ? 
  which one is executed first ?


thanks
-- 
Mit freundlichen Grüßen / Kind regards 

Enrico Weigelt 
VNC - Virtual Network Consult GmbH 
Head Of Development 

Pariser Platz 4a, D-10117 Berlin
Tel.: +49 (30) 3464615-20
Fax: +49 (30) 3464615-59

enrico.weigelt@vnc.biz; www.vnc.de 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch IO optimization
  2012-10-11 15:39 ` filter-branch IO optimization Enrico Weigelt
@ 2012-10-11 18:36   ` Johannes Sixt
  2012-10-11 20:34   ` Thomas Rast
  1 sibling, 0 replies; 7+ messages in thread
From: Johannes Sixt @ 2012-10-11 18:36 UTC (permalink / raw)
  To: Enrico Weigelt; +Cc: git list

Am 11.10.2012 17:39, schrieb Enrico Weigelt:
> The main goal of this filtering is splitting out many modules from a
> large upstream repo into their own downstream repos.
...
> The next step I have in mind is using --subdirectory-filter, but open
> questsions are:
> 
> * does it suffer from the same problems w/ empty username/email like --tree-filter ?

I think so.

> ** if yes: what can I do about it (have an additional pass for fixing that before
>    running the --tree-filter ?

Use --env-filter.

> * can I somehow teach the --subdirectory filter to place the result under some
>   somedir instead of directly to root ?

No, but see the last example in the man page.

> * can I use --tree-filter in combination with --subdireectory-filter ? 
>   which one is executed first ?

Yes. --subdirectory-filter applies first.

-- Hannes

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch IO optimization
  2012-10-11 15:39 ` filter-branch IO optimization Enrico Weigelt
  2012-10-11 18:36   ` Johannes Sixt
@ 2012-10-11 20:34   ` Thomas Rast
  2012-10-12 14:49     ` Enrico Weigelt
  1 sibling, 1 reply; 7+ messages in thread
From: Thomas Rast @ 2012-10-11 20:34 UTC (permalink / raw)
  To: Enrico Weigelt; +Cc: git list

Enrico Weigelt <enrico.weigelt@vnc.biz> writes:

> for certain projects, I need to regularily run filter-branch on quite
> large repos (>10k commits), and that needs to be run multiple times,
> which takes several hours, so I'm looking for optimizations.
[...]
> #2: run a tree-filter which:
>     * removes all files not belonging to the wanted module
>     * move the module directory under another subdir (./addons/)
>     * fix author/comitter name/email if empty (because otherwise fails)

The usual advice is "use an index-filter instead".  It's *much* faster
than a tree filter.  However:

>     * fix charater sets and indentions of source files

That last step is rather crazy.  At the very least you will want to only
operate on files that were changed since the parent commit, so as to
avoid scanning the whole tree.  If you do this right, it should also fit
into an index-filter.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch IO optimization
  2012-10-11 20:34   ` Thomas Rast
@ 2012-10-12 14:49     ` Enrico Weigelt
  2012-10-12 15:59       ` Enrico Weigelt
  2012-10-12 17:20       ` Jeff King
  0 siblings, 2 replies; 7+ messages in thread
From: Enrico Weigelt @ 2012-10-12 14:49 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git list

Hi,

> The usual advice is "use an index-filter instead".  It's *much*
> faster
> than a tree filter.  However:

I've tried the last example from git-filter-branch manpage, but failed.
Seems like the GIT_INDEX_FILE env variable doesnt get honoured by
git-update-index, no index.new file created, and so mv call fails.

My second try (as index-filter command) was:

git ls-files -s > ../_INDEX_TMP
cat ../_INDEX_TMP |
    sed "s-\t\"*-&addons/-" |
    git update-index --index-info
rm -f ../_INDEX_TMP

It works fine in the worktree (i see files renamed in the index),
but no success when running it as --index-filter. Seems the index
file isn't used at all (or some completely different one).

By the way, inside the index filter, GIT_INDEX_FILTER here is

/home/devel/vnc/openerp/workspace/pkg/openerp-extra-bundle.git/.git-rewrite/t/../index

Obviously a different (temporary) index file, while many examples
on the web, suggesting to use commands like 'git add --cached' or
'git rm --cached' _without_ passing GIT_INDEX_FILTER variable.

Could there be some bug that this variable isn't honored properly
everywhere ?

--
Mit freundlichen Grüßen / Kind regards

Enrico Weigelt
VNC - Virtual Network Consult GmbH
Head Of Development

Pariser Platz 4a, D-10117 Berlin
Tel.: +49 (30) 3464615-20
Fax: +49 (30) 3464615-59

enrico.weigelt@vnc.biz; www.vnc.de

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch IO optimization
  2012-10-12 14:49     ` Enrico Weigelt
@ 2012-10-12 15:59       ` Enrico Weigelt
  2012-10-12 17:20         ` Enrico Weigelt
  2012-10-12 17:20       ` Jeff King
  1 sibling, 1 reply; 7+ messages in thread
From: Enrico Weigelt @ 2012-10-12 15:59 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git list

<snip>

Did some more experiments, and it seems that missing index file
isn't automatically created.

When I instead copy the original index file to the temporary
location, it runs well. But I still have to wait for the final
result to check whether it really overwrites the whole index
or just adds new files.


cu
-- 
Mit freundlichen Grüßen / Kind regards 

Enrico Weigelt 
VNC - Virtual Network Consult GmbH 
Head Of Development 

Pariser Platz 4a, D-10117 Berlin
Tel.: +49 (30) 3464615-20
Fax: +49 (30) 3464615-59

enrico.weigelt@vnc.biz; www.vnc.de 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch IO optimization
  2012-10-12 15:59       ` Enrico Weigelt
@ 2012-10-12 17:20         ` Enrico Weigelt
  0 siblings, 0 replies; 7+ messages in thread
From: Enrico Weigelt @ 2012-10-12 17:20 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git list

Hi folks,

now finally managed the index-filter part.
The main problem, IIRC, was that git-update-index didn't
automatically create an empty index, so I needed to explicitly
copy in (manually created it with an empty repo).

My current filter code is:

if [ ! "$GIT_AUTHOR_EMAIL" ] && [ ! "$GIT_COMMITTER_EMAIL" ]; then
	export GIT_AUTHOR_EMAIL="nobody@none.org"
	export GIT_COMMITTER_NAME="nobody@none.org"
elif [ ! "$GIT_AUTHOR_EMAIL" ]; then
	export GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"
elif [ ! "$GIT_COMITTER_EMAIL" ]; then
	export GIT_COMMITTER_EMAIL="$GIT_AUTHOR_NAME"
fi

if [ ! "$GIT_AUTHOR_NAME" ] && [ ! "$GIT_COMMITTER_NAME" ]; then
	export GIT_AUTHOR_NAME="nobody@none.org"
	export GIT_COMMITTER_NAME="nobody@none.org"
elif [ ! "$GIT_AUTHOR_NAME" ]; then
	export GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"
elif [ ! "$GIT_COMITTER_NAME" ]; then
	export GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME"
fi

cp ../../../../scripts/index.empty $GIT_INDEX_FILE.new

git ls-files -s |
    sed "s-\t\"*-&addons/-" |
    grep -e "\t*addons/$module" |
    ( export GIT_INDEX_FILE=$GIT_INDEX_FILE.new ; git update-index --index-info )

mv $GIT_INDEX_FILE.new $GIT_INDEX_FILE


Now another problem: this leaves behind thousands of now empty
merge nodes (--prune-empty doesnt seem to catch them all),
so I loop through additional `git filter-branch --prune-empty`
runs, until the ref remains unchanged.

This process is even more time-consuming, as it takes really many
passes (havent counted them yet).

Does anyone have an idea, why a single run doesnt catch that all?


cu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filter-branch IO optimization
  2012-10-12 14:49     ` Enrico Weigelt
  2012-10-12 15:59       ` Enrico Weigelt
@ 2012-10-12 17:20       ` Jeff King
  1 sibling, 0 replies; 7+ messages in thread
From: Jeff King @ 2012-10-12 17:20 UTC (permalink / raw)
  To: Enrico Weigelt; +Cc: Thomas Rast, git list

On Fri, Oct 12, 2012 at 04:49:54PM +0200, Enrico Weigelt wrote:

> > The usual advice is "use an index-filter instead".  It's *much*
> > faster
> > than a tree filter.  However:
> 
> I've tried the last example from git-filter-branch manpage, but failed.
> Seems like the GIT_INDEX_FILE env variable doesnt get honoured by
> git-update-index, no index.new file created, and so mv call fails.
> 
> My second try (as index-filter command) was:
> 
> git ls-files -s > ../_INDEX_TMP
> cat ../_INDEX_TMP |
>     sed "s-\t\"*-&addons/-" |
>     git update-index --index-info
> rm -f ../_INDEX_TMP

I didn't look closely at your individual problem, but that example has
proven flaky before.  There were some simpler formulations given in this
thread:

  http://thread.gmane.org/gmane.comp.version-control.git/195492

In particular, Junio suggested:

  git filter-branch --index-filter '
    rm -f "$GIT_INDEX_FILE"
    git read-tree --prefix=newsubdir/ "$GIT_COMMIT"
  ' HEAD

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-10-12 17:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <7e000a0f-9e4e-4a4d-a8ce-5d017e17939c@zcs>
2012-10-11 15:39 ` filter-branch IO optimization Enrico Weigelt
2012-10-11 18:36   ` Johannes Sixt
2012-10-11 20:34   ` Thomas Rast
2012-10-12 14:49     ` Enrico Weigelt
2012-10-12 15:59       ` Enrico Weigelt
2012-10-12 17:20         ` Enrico Weigelt
2012-10-12 17:20       ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).