git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git filter-branch --subdirectory-filter
@ 2008-05-09  1:01 James Sadler
  2008-05-09  1:33 ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: James Sadler @ 2008-05-09  1:01 UTC (permalink / raw)
  To: git

Hi All,

I have some issues with git filter-branch.

I have a git repository that I wish to split into multiple seperate
repositories for each logical
module that it contains. Each logical module is already in its own
directory at the root of the repo.

My experiments with 'git filter-branch' have been *partially* successful.

To extract a module into its own repo, I first copied the original
repo (this was a simple cp -r,
as it seemed to be the simplest way as git clone doesn't get all the branches)
and ran filter-branch with a --commit-filter to skip commits that were
irrelevant to th subdir.

That step worked just fine.

The next pass was to 'hoist' the contents of the subdir in the new
repo into the root dir.
I thought I could do this with a --subdirectory-filter argument to
filter-branch, except when I do
this, I loose tons of commits.  (The working tree is correct, i.e. the
same as the original repo
working tree, but the history is screwed).

Anybody have any idea what I am doing wrong?  If it can't be done with
--subdirectory-filter can
it be done with the 'subtree' merge strategy somehow?

Cheers,
-- 
James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git filter-branch --subdirectory-filter
  2008-05-09  1:01 git filter-branch --subdirectory-filter James Sadler
@ 2008-05-09  1:33 ` Jeff King
  2008-05-09  7:38   ` James Sadler
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2008-05-09  1:33 UTC (permalink / raw)
  To: James Sadler; +Cc: git

On Fri, May 09, 2008 at 11:01:47AM +1000, James Sadler wrote:

> I have a git repository that I wish to split into multiple seperate
> repositories for each logical module that it contains. Each logical
> module is already in its own directory at the root of the repo.

OK.

> To extract a module into its own repo, I first copied the original
> repo (this was a simple cp -r, as it seemed to be the simplest way as
> git clone doesn't get all the branches)

It does copy them, but they're just "remote tracking branches". If you
have many branches, you can recreate them via a loop with git-branch, or
by "git fetch . refs/remotes/origin/*:refs/heads/*". If you have only
one branch, you might just want to make a few copies of it with "for i
in repo1 repo2; do git branch $i master; done", and then filter-branch
those branches.

In either case, your cp is fine, if just a little less efficient.

> and ran filter-branch with a --commit-filter to skip commits that were
> irrelevant to th subdir.

But that's part of what subdirectory-filter does, so this step is
unnecessary.

> The next pass was to 'hoist' the contents of the subdir in the new
> repo into the root dir.

And that's the other part of what subdirectory-filter does.

> I thought I could do this with a --subdirectory-filter argument to
> filter-branch, except when I do this, I loose tons of commits.  (The
> working tree is correct, i.e. the same as the original repo working
> tree, but the history is screwed).

You'll have to be more specific about what's wrong with history. Of
course some commits will be gone after filtering the subdir (those that
didn't touch anything in the subdir); that's part of the point.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git filter-branch --subdirectory-filter
  2008-05-09  1:33 ` Jeff King
@ 2008-05-09  7:38   ` James Sadler
  2008-05-09  7:57     ` Johannes Sixt
  2008-05-09  8:00     ` Jeff King
  0 siblings, 2 replies; 10+ messages in thread
From: James Sadler @ 2008-05-09  7:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Hi Jeff,

After reading your reponse and re-reading my original email, I
realised it was totally unclear
so I have re-explained myself below.

2008/5/9 Jeff King <peff@peff.net>:
> On Fri, May 09, 2008 at 11:01:47AM +1000, James Sadler wrote:
>
>> I have a git repository that I wish to split into multiple separate
>> repositories for each logical module that it contains. Each logical
>> module is already in its own directory at the root of the repo.
>
> OK.
>
>> To extract a module into its own repo, I first copied the original
>> repo (this was a simple cp -r, as it seemed to be the simplest way as
>> git clone doesn't get all the branches)

I must have experienced a brain fart or something or missed the '-r' from
git branch...

>> and ran filter-branch with a --commit-filter to skip commits that were
>> irrelevant to th subdir.
>
> But that's part of what subdirectory-filter does, so this step is
> unnecessary.

Yes that's true, but...

Clearer explanation:

I originally tried --subdirectory-filter by itself to see if it would
do the job, but it filtered
more commits than I thought it should (some commits that touched the subdir were
missing after filter-branch was run).

I then began to question my understanding of the semantics of
subdirectory-filter.

Is it meant to:
A) Only keep commits where ALL of the changes in the commit only touch
content under $DIR?
B) Only keep commits where SOME of the changes in the commit touch
content under $DIR?

I suspected that it was behaving as A.

That's when I decided to run the commit-filter first in combination
with the tree-filter.  This would
leave me with all commits that touched the subdir but any commit that
touched multiple subdirs
would be cleaned up so it only touched the subdir I want to keep.

At this point I have a bunch of commits that only make changes to
subdir (verified using gitk), and I would
expect subdirectory-filter to keep every single commit.

However, after running it, I loose most of my commits.  Strangely, the
working tree is bit-for-bit correct
with the original version or the subdir in the old repo, but the
history leading up to it is not.

--subdirectory-filter does not seem to behave as either A or B above
but something other way.  I'm sure
it will turn out to be something silly, but I'm pulling my hair out
trying to figure this one out.

Hopefully that's a clearer explanation!

-- 
James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git filter-branch --subdirectory-filter
  2008-05-09  7:38   ` James Sadler
@ 2008-05-09  7:57     ` Johannes Sixt
  2008-05-09  8:00     ` Jeff King
  1 sibling, 0 replies; 10+ messages in thread
From: Johannes Sixt @ 2008-05-09  7:57 UTC (permalink / raw)
  To: James Sadler; +Cc: Jeff King, git

James Sadler schrieb:
> Hi Jeff,
> 
> After reading your reponse and re-reading my original email, I
> realised it was totally unclear
> so I have re-explained myself below.
> 
> 2008/5/9 Jeff King <peff@peff.net>:
>> On Fri, May 09, 2008 at 11:01:47AM +1000, James Sadler wrote:
>>> and ran filter-branch with a --commit-filter to skip commits that were
>>> irrelevant to th subdir.
>> But that's part of what subdirectory-filter does, so this step is
>> unnecessary.
> 
> Yes that's true, but...
> 
> Clearer explanation:
> 
> I originally tried --subdirectory-filter by itself to see if it would
> do the job, but it filtered
> more commits than I thought it should (some commits that touched the subdir were
> missing after filter-branch was run).
> 
> I then began to question my understanding of the semantics of
> subdirectory-filter.
> 
> Is it meant to:
> A) Only keep commits where ALL of the changes in the commit only touch
> content under $DIR?
> B) Only keep commits where SOME of the changes in the commit touch
> content under $DIR?
> 
> I suspected that it was behaving as A.

It's expected to do B.

> That's when I decided to run the commit-filter first in combination
> with the tree-filter.  This would
> leave me with all commits that touched the subdir but any commit that
> touched multiple subdirs
> would be cleaned up so it only touched the subdir I want to keep.
> 
> At this point I have a bunch of commits that only make changes to
> subdir (verified using gitk), and I would
> expect subdirectory-filter to keep every single commit.

At this point you don't need subdirectory-filter. Use an --index-filter to
 keep only the subdirectory *and* remove the directory name at the same
time. Something like this:

git filter-branch --index-filter \
        'git ls-files -s thedir | sed "s-\tthedir/--" |
                GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
                        git update-index --index-info &&
         mv $GIT_INDEX_FILE.new $GIT_INDEX_FILE' HEAD

> However, after running it, I loose most of my commits.  Strangely, the
> working tree is bit-for-bit correct
> with the original version or the subdir in the old repo, but the
> history leading up to it is not.

The bit-for-bit correctness is not surprising, but the incorrect history
is. What is your definition of "correct" (i.e. can you give an example of
your expectations that are not met)? Do you have complicated history (with
merges)? Note that merges are removed if all but one of the merged
branches do not touch the subdirectory.

-- Hannes

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git filter-branch --subdirectory-filter
  2008-05-09  7:38   ` James Sadler
  2008-05-09  7:57     ` Johannes Sixt
@ 2008-05-09  8:00     ` Jeff King
  2008-05-10  3:31       ` James Sadler
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2008-05-09  8:00 UTC (permalink / raw)
  To: James Sadler; +Cc: git

On Fri, May 09, 2008 at 05:38:12PM +1000, James Sadler wrote:

> I originally tried --subdirectory-filter by itself to see if it would
> do the job, but it filtered more commits than I thought it should
> (some commits that touched the subdir were missing after filter-branch
> was run).
> 
> I then began to question my understanding of the semantics of
> subdirectory-filter.
> 
> Is it meant to:
> A) Only keep commits where ALL of the changes in the commit only touch
> content under $DIR?
> B) Only keep commits where SOME of the changes in the commit touch
> content under $DIR?
> 
> I suspected that it was behaving as A.

My understanding is that it should behave as B. E.g.:

  git init
  mkdir subdir1 subdir2
  echo content 1 >subdir1/file
  echo content 2 >subdir2/file
  git add .
  git commit -m initial
  echo changes 1 >>subdir1/file
  git commit -a -m 'only one'
  echo more changes 1 >>subdir1/file
  echo more changes 2 >>subdir2/file
  git commit -a -m 'both'
  git filter-branch --subdirectory-filter subdir1
  git log --name-status --pretty=oneline

should show something like:

  b119e21829b6039aa8fe938fb0304a9a7436b84d both
  M       file
  db2ad8e702f36a1df99dd529aa594e756010b191 only one
  M       file
  dacb4c2536e61c18079bcc73ea81fa0fb139c097 initial
  A       file

IOW, all commits touch subdir1/file, which becomes just 'file'.

It could be a bug in git-filter-branch. What version of git are you
using?

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git filter-branch --subdirectory-filter
  2008-05-09  8:00     ` Jeff King
@ 2008-05-10  3:31       ` James Sadler
  2008-05-10  5:53         ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: James Sadler @ 2008-05-10  3:31 UTC (permalink / raw)
  To: Jeff King; +Cc: git

2008/5/9 Jeff King <peff@peff.net>:
> On Fri, May 09, 2008 at 05:38:12PM +1000, James Sadler wrote:
>
>> I originally tried --subdirectory-filter by itself to see if it would
>> do the job, but it filtered more commits than I thought it should
>> (some commits that touched the subdir were missing after filter-branch
>> was run).
>>
>> I then began to question my understanding of the semantics of
>> subdirectory-filter.
>>
>> Is it meant to:
>> A) Only keep commits where ALL of the changes in the commit only touch
>> content under $DIR?
>> B) Only keep commits where SOME of the changes in the commit touch
>> content under $DIR?
>>
>> I suspected that it was behaving as A.
>
> My understanding is that it should behave as B. E.g.:
>
>  git init
>  mkdir subdir1 subdir2
>  echo content 1 >subdir1/file
>  echo content 2 >subdir2/file
>  git add .
>  git commit -m initial
>  echo changes 1 >>subdir1/file
>  git commit -a -m 'only one'
>  echo more changes 1 >>subdir1/file
>  echo more changes 2 >>subdir2/file
>  git commit -a -m 'both'
>  git filter-branch --subdirectory-filter subdir1
>  git log --name-status --pretty=oneline
>
> should show something like:
>
>  b119e21829b6039aa8fe938fb0304a9a7436b84d both
>  M       file
>  db2ad8e702f36a1df99dd529aa594e756010b191 only one
>  M       file
>  dacb4c2536e61c18079bcc73ea81fa0fb139c097 initial
>  A       file
>

Behaving as B is definitely the desired behaviour, but I am not observing that.
I'll see if I can create a test case to demonstrate.  Unfortunately,
I don't have the right to distribute our repo so will have to attempt
to reproduce the
problem another way.

Does anybody have a script that can take an existing repo,
and create a new one with garbled-but-equivalent commits?  i.e.  file
and directory structure
is same with names changed, and there is a one-one relationship
between lines of text
in new repo and old one except the lines have been scrambled?  It would be
a useful tool for distributing private repositories for debugging reasons.

> IOW, all commits touch subdir1/file, which becomes just 'file'.
>
> It could be a bug in git-filter-branch. What version of git are you
> using?

I am using git version 1.5.5

>
> -Peff
>

-- 
James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git filter-branch --subdirectory-filter
  2008-05-10  3:31       ` James Sadler
@ 2008-05-10  5:53         ` Jeff King
  2008-05-10  7:10           ` James Sadler
  2008-05-10 11:38           ` James Sadler
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff King @ 2008-05-10  5:53 UTC (permalink / raw)
  To: James Sadler; +Cc: git

On Sat, May 10, 2008 at 01:31:37PM +1000, James Sadler wrote:

> Does anybody have a script that can take an existing repo, and create
> a new one with garbled-but-equivalent commits?  i.e.  file and
> directory structure is same with names changed, and there is a one-one
> relationship between lines of text in new repo and old one except the
> lines have been scrambled?  It would be a useful tool for distributing
> private repositories for debugging reasons.

This is only lightly tested, but the script below should do the trick.
It works as an index filter which munges all content in such a way that
a particular line is always given the same replacement text. That means
that diffs will look approximately the same, but will add and remove
lines that say "Fake line XXX" instead of the actual content.

You can munge the commit messages themselves by just replacing them with
some unique text; in the example below, we just replace them with the
md5sum of the content.

This will leave the original author, committer, and date, which is
presumably non-proprietary.

-- >8 --
#!/usr/bin/perl
#
# Obscure a repository while still maintaining the same history
# structure and diffs.
#
# Invoke as:
#   git filter-branch \
#     --msg-filter md5sum \
#     --index-filter /path/to/this/script

use strict;
use IPC::Open2;
use DB_File;
use Fcntl;
tie my %blob_cache, 'DB_File', 'blob-cache', O_RDWR|O_CREAT, 0666;
tie my %line_cache, 'DB_File', 'line-cache', O_RDWR|O_CREAT, 0666;

open(my $lsfiles, '-|', qw(git ls-files --stage))
  or die "unable to open ls-files: $!";
open(my $update, '|-', qw(git update-index --index-info))
  or die "unable to open upate-inex: $!";

while(<$lsfiles>) {
  my ($mode, $hash, $path) = /^(\d+) ([0-9a-f]{40}) \d\t(.*)/
    or die "bad ls-files line: $_";
  $blob_cache{$hash} = munge($hash)
    unless exists $blob_cache{$hash};
  print $update "$mode $blob_cache{$hash}\t$path\n";
}

close($lsfiles);
close($update);
exit $?;

sub munge {
  my $h = shift;

  open(my $in, '-|', qw(git show), $h)
    or die "unable to open git show: $!";
  open2(my $hash, my $out, qw(git hash-object -w --stdin));

  while(<$in>) {
    $line_cache{$_} ||= 'Fake line ' . $line_cache{CURRENT}++ . "\n";
    print $out $line_cache{$_};
  }

  close($in);
  close($out);

  my $r = <$hash>;
  chomp $r;
  return $r;
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git filter-branch --subdirectory-filter
  2008-05-10  5:53         ` Jeff King
@ 2008-05-10  7:10           ` James Sadler
  2008-05-10 11:38           ` James Sadler
  1 sibling, 0 replies; 10+ messages in thread
From: James Sadler @ 2008-05-10  7:10 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Excellent!  I'll give that a whirl, thanks.

- James.

2008/5/10 Jeff King <peff@peff.net>:
> On Sat, May 10, 2008 at 01:31:37PM +1000, James Sadler wrote:
>
>> Does anybody have a script that can take an existing repo, and create
>> a new one with garbled-but-equivalent commits?  i.e.  file and
>> directory structure is same with names changed, and there is a one-one
>> relationship between lines of text in new repo and old one except the
>> lines have been scrambled?  It would be a useful tool for distributing
>> private repositories for debugging reasons.
>
> This is only lightly tested, but the script below should do the trick.
> It works as an index filter which munges all content in such a way that
> a particular line is always given the same replacement text. That means
> that diffs will look approximately the same, but will add and remove
> lines that say "Fake line XXX" instead of the actual content.
>
> You can munge the commit messages themselves by just replacing them with
> some unique text; in the example below, we just replace them with the
> md5sum of the content.
>
> This will leave the original author, committer, and date, which is
> presumably non-proprietary.
>
> -- >8 --
> #!/usr/bin/perl
> #
> # Obscure a repository while still maintaining the same history
> # structure and diffs.
> #
> # Invoke as:
> #   git filter-branch \
> #     --msg-filter md5sum \
> #     --index-filter /path/to/this/script
>
> use strict;
> use IPC::Open2;
> use DB_File;
> use Fcntl;
> tie my %blob_cache, 'DB_File', 'blob-cache', O_RDWR|O_CREAT, 0666;
> tie my %line_cache, 'DB_File', 'line-cache', O_RDWR|O_CREAT, 0666;
>
> open(my $lsfiles, '-|', qw(git ls-files --stage))
>  or die "unable to open ls-files: $!";
> open(my $update, '|-', qw(git update-index --index-info))
>  or die "unable to open upate-inex: $!";
>
> while(<$lsfiles>) {
>  my ($mode, $hash, $path) = /^(\d+) ([0-9a-f]{40}) \d\t(.*)/
>    or die "bad ls-files line: $_";
>  $blob_cache{$hash} = munge($hash)
>    unless exists $blob_cache{$hash};
>  print $update "$mode $blob_cache{$hash}\t$path\n";
> }
>
> close($lsfiles);
> close($update);
> exit $?;
>
> sub munge {
>  my $h = shift;
>
>  open(my $in, '-|', qw(git show), $h)
>    or die "unable to open git show: $!";
>  open2(my $hash, my $out, qw(git hash-object -w --stdin));
>
>  while(<$in>) {
>    $line_cache{$_} ||= 'Fake line ' . $line_cache{CURRENT}++ . "\n";
>    print $out $line_cache{$_};
>  }
>
>  close($in);
>  close($out);
>
>  my $r = <$hash>;
>  chomp $r;
>  return $r;
> }
>



-- 
James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git filter-branch --subdirectory-filter
  2008-05-10  5:53         ` Jeff King
  2008-05-10  7:10           ` James Sadler
@ 2008-05-10 11:38           ` James Sadler
  2008-05-10 11:44             ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: James Sadler @ 2008-05-10 11:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git

2008/5/10 Jeff King <peff@peff.net>:
> On Sat, May 10, 2008 at 01:31:37PM +1000, James Sadler wrote:
>
>> Does anybody have a script that can take an existing repo, and create
>> a new one with garbled-but-equivalent commits?  i.e.  file and
>> directory structure is same with names changed, and there is a one-one
>> relationship between lines of text in new repo and old one except the
>> lines have been scrambled?  It would be a useful tool for distributing
>> private repositories for debugging reasons.
>
> This is only lightly tested, but the script below should do the trick.
> It works as an index filter which munges all content in such a way that
> a particular line is always given the same replacement text. That means
> that diffs will look approximately the same, but will add and remove
> lines that say "Fake line XXX" instead of the actual content.
>
> You can munge the commit messages themselves by just replacing them with
> some unique text; in the example below, we just replace them with the
> md5sum of the content.
>
> This will leave the original author, committer, and date, which is
> presumably non-proprietary.
>

> <snip>

Jeff,

I have run your script on my repo and now have an obfuscated version.
When I run 'git filter-branch -subdirectory filter $DIR' on this repo, the same
problem occurs, i.e. there are fewer commits remaining than I would expect.

If I place this repo somewhere you can download it, would you be kind enough
to take a look?  I'll detail the steps required to reproduce in another post.

Thanks,

James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git filter-branch --subdirectory-filter
  2008-05-10 11:38           ` James Sadler
@ 2008-05-10 11:44             ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2008-05-10 11:44 UTC (permalink / raw)
  To: James Sadler; +Cc: git

On Sat, May 10, 2008 at 09:38:59PM +1000, James Sadler wrote:

> I have run your script on my repo and now have an obfuscated version.
> When I run 'git filter-branch -subdirectory filter $DIR' on this repo,
> the same problem occurs, i.e. there are fewer commits remaining than I
> would expect.

Great, I'm glad the obfuscation worked.

> If I place this repo somewhere you can download it, would you be kind
> enough to take a look?  I'll detail the steps required to reproduce in
> another post.

Sure.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-05-10 11:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-09  1:01 git filter-branch --subdirectory-filter James Sadler
2008-05-09  1:33 ` Jeff King
2008-05-09  7:38   ` James Sadler
2008-05-09  7:57     ` Johannes Sixt
2008-05-09  8:00     ` Jeff King
2008-05-10  3:31       ` James Sadler
2008-05-10  5:53         ` Jeff King
2008-05-10  7:10           ` James Sadler
2008-05-10 11:38           ` James Sadler
2008-05-10 11:44             ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).