* git filter-branch --subdirectory-filter @ 2008-05-09 1:01 James Sadler 2008-05-09 1:33 ` Jeff King 0 siblings, 1 reply; 10+ messages in thread From: James Sadler @ 2008-05-09 1:01 UTC (permalink / raw) To: git Hi All, I have some issues with git filter-branch. I have a git repository that I wish to split into multiple seperate repositories for each logical module that it contains. Each logical module is already in its own directory at the root of the repo. My experiments with 'git filter-branch' have been *partially* successful. To extract a module into its own repo, I first copied the original repo (this was a simple cp -r, as it seemed to be the simplest way as git clone doesn't get all the branches) and ran filter-branch with a --commit-filter to skip commits that were irrelevant to th subdir. That step worked just fine. The next pass was to 'hoist' the contents of the subdir in the new repo into the root dir. I thought I could do this with a --subdirectory-filter argument to filter-branch, except when I do this, I loose tons of commits. (The working tree is correct, i.e. the same as the original repo working tree, but the history is screwed). Anybody have any idea what I am doing wrong? If it can't be done with --subdirectory-filter can it be done with the 'subtree' merge strategy somehow? Cheers, -- James ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git filter-branch --subdirectory-filter 2008-05-09 1:01 git filter-branch --subdirectory-filter James Sadler @ 2008-05-09 1:33 ` Jeff King 2008-05-09 7:38 ` James Sadler 0 siblings, 1 reply; 10+ messages in thread From: Jeff King @ 2008-05-09 1:33 UTC (permalink / raw) To: James Sadler; +Cc: git On Fri, May 09, 2008 at 11:01:47AM +1000, James Sadler wrote: > I have a git repository that I wish to split into multiple seperate > repositories for each logical module that it contains. Each logical > module is already in its own directory at the root of the repo. OK. > To extract a module into its own repo, I first copied the original > repo (this was a simple cp -r, as it seemed to be the simplest way as > git clone doesn't get all the branches) It does copy them, but they're just "remote tracking branches". If you have many branches, you can recreate them via a loop with git-branch, or by "git fetch . refs/remotes/origin/*:refs/heads/*". If you have only one branch, you might just want to make a few copies of it with "for i in repo1 repo2; do git branch $i master; done", and then filter-branch those branches. In either case, your cp is fine, if just a little less efficient. > and ran filter-branch with a --commit-filter to skip commits that were > irrelevant to th subdir. But that's part of what subdirectory-filter does, so this step is unnecessary. > The next pass was to 'hoist' the contents of the subdir in the new > repo into the root dir. And that's the other part of what subdirectory-filter does. > I thought I could do this with a --subdirectory-filter argument to > filter-branch, except when I do this, I loose tons of commits. (The > working tree is correct, i.e. the same as the original repo working > tree, but the history is screwed). You'll have to be more specific about what's wrong with history. Of course some commits will be gone after filtering the subdir (those that didn't touch anything in the subdir); that's part of the point. -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git filter-branch --subdirectory-filter 2008-05-09 1:33 ` Jeff King @ 2008-05-09 7:38 ` James Sadler 2008-05-09 7:57 ` Johannes Sixt 2008-05-09 8:00 ` Jeff King 0 siblings, 2 replies; 10+ messages in thread From: James Sadler @ 2008-05-09 7:38 UTC (permalink / raw) To: Jeff King; +Cc: git Hi Jeff, After reading your reponse and re-reading my original email, I realised it was totally unclear so I have re-explained myself below. 2008/5/9 Jeff King <peff@peff.net>: > On Fri, May 09, 2008 at 11:01:47AM +1000, James Sadler wrote: > >> I have a git repository that I wish to split into multiple separate >> repositories for each logical module that it contains. Each logical >> module is already in its own directory at the root of the repo. > > OK. > >> To extract a module into its own repo, I first copied the original >> repo (this was a simple cp -r, as it seemed to be the simplest way as >> git clone doesn't get all the branches) I must have experienced a brain fart or something or missed the '-r' from git branch... >> and ran filter-branch with a --commit-filter to skip commits that were >> irrelevant to th subdir. > > But that's part of what subdirectory-filter does, so this step is > unnecessary. Yes that's true, but... Clearer explanation: I originally tried --subdirectory-filter by itself to see if it would do the job, but it filtered more commits than I thought it should (some commits that touched the subdir were missing after filter-branch was run). I then began to question my understanding of the semantics of subdirectory-filter. Is it meant to: A) Only keep commits where ALL of the changes in the commit only touch content under $DIR? B) Only keep commits where SOME of the changes in the commit touch content under $DIR? I suspected that it was behaving as A. That's when I decided to run the commit-filter first in combination with the tree-filter. This would leave me with all commits that touched the subdir but any commit that touched multiple subdirs would be cleaned up so it only touched the subdir I want to keep. At this point I have a bunch of commits that only make changes to subdir (verified using gitk), and I would expect subdirectory-filter to keep every single commit. However, after running it, I loose most of my commits. Strangely, the working tree is bit-for-bit correct with the original version or the subdir in the old repo, but the history leading up to it is not. --subdirectory-filter does not seem to behave as either A or B above but something other way. I'm sure it will turn out to be something silly, but I'm pulling my hair out trying to figure this one out. Hopefully that's a clearer explanation! -- James ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git filter-branch --subdirectory-filter 2008-05-09 7:38 ` James Sadler @ 2008-05-09 7:57 ` Johannes Sixt 2008-05-09 8:00 ` Jeff King 1 sibling, 0 replies; 10+ messages in thread From: Johannes Sixt @ 2008-05-09 7:57 UTC (permalink / raw) To: James Sadler; +Cc: Jeff King, git James Sadler schrieb: > Hi Jeff, > > After reading your reponse and re-reading my original email, I > realised it was totally unclear > so I have re-explained myself below. > > 2008/5/9 Jeff King <peff@peff.net>: >> On Fri, May 09, 2008 at 11:01:47AM +1000, James Sadler wrote: >>> and ran filter-branch with a --commit-filter to skip commits that were >>> irrelevant to th subdir. >> But that's part of what subdirectory-filter does, so this step is >> unnecessary. > > Yes that's true, but... > > Clearer explanation: > > I originally tried --subdirectory-filter by itself to see if it would > do the job, but it filtered > more commits than I thought it should (some commits that touched the subdir were > missing after filter-branch was run). > > I then began to question my understanding of the semantics of > subdirectory-filter. > > Is it meant to: > A) Only keep commits where ALL of the changes in the commit only touch > content under $DIR? > B) Only keep commits where SOME of the changes in the commit touch > content under $DIR? > > I suspected that it was behaving as A. It's expected to do B. > That's when I decided to run the commit-filter first in combination > with the tree-filter. This would > leave me with all commits that touched the subdir but any commit that > touched multiple subdirs > would be cleaned up so it only touched the subdir I want to keep. > > At this point I have a bunch of commits that only make changes to > subdir (verified using gitk), and I would > expect subdirectory-filter to keep every single commit. At this point you don't need subdirectory-filter. Use an --index-filter to keep only the subdirectory *and* remove the directory name at the same time. Something like this: git filter-branch --index-filter \ 'git ls-files -s thedir | sed "s-\tthedir/--" | GIT_INDEX_FILE=$GIT_INDEX_FILE.new \ git update-index --index-info && mv $GIT_INDEX_FILE.new $GIT_INDEX_FILE' HEAD > However, after running it, I loose most of my commits. Strangely, the > working tree is bit-for-bit correct > with the original version or the subdir in the old repo, but the > history leading up to it is not. The bit-for-bit correctness is not surprising, but the incorrect history is. What is your definition of "correct" (i.e. can you give an example of your expectations that are not met)? Do you have complicated history (with merges)? Note that merges are removed if all but one of the merged branches do not touch the subdirectory. -- Hannes ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git filter-branch --subdirectory-filter 2008-05-09 7:38 ` James Sadler 2008-05-09 7:57 ` Johannes Sixt @ 2008-05-09 8:00 ` Jeff King 2008-05-10 3:31 ` James Sadler 1 sibling, 1 reply; 10+ messages in thread From: Jeff King @ 2008-05-09 8:00 UTC (permalink / raw) To: James Sadler; +Cc: git On Fri, May 09, 2008 at 05:38:12PM +1000, James Sadler wrote: > I originally tried --subdirectory-filter by itself to see if it would > do the job, but it filtered more commits than I thought it should > (some commits that touched the subdir were missing after filter-branch > was run). > > I then began to question my understanding of the semantics of > subdirectory-filter. > > Is it meant to: > A) Only keep commits where ALL of the changes in the commit only touch > content under $DIR? > B) Only keep commits where SOME of the changes in the commit touch > content under $DIR? > > I suspected that it was behaving as A. My understanding is that it should behave as B. E.g.: git init mkdir subdir1 subdir2 echo content 1 >subdir1/file echo content 2 >subdir2/file git add . git commit -m initial echo changes 1 >>subdir1/file git commit -a -m 'only one' echo more changes 1 >>subdir1/file echo more changes 2 >>subdir2/file git commit -a -m 'both' git filter-branch --subdirectory-filter subdir1 git log --name-status --pretty=oneline should show something like: b119e21829b6039aa8fe938fb0304a9a7436b84d both M file db2ad8e702f36a1df99dd529aa594e756010b191 only one M file dacb4c2536e61c18079bcc73ea81fa0fb139c097 initial A file IOW, all commits touch subdir1/file, which becomes just 'file'. It could be a bug in git-filter-branch. What version of git are you using? -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git filter-branch --subdirectory-filter 2008-05-09 8:00 ` Jeff King @ 2008-05-10 3:31 ` James Sadler 2008-05-10 5:53 ` Jeff King 0 siblings, 1 reply; 10+ messages in thread From: James Sadler @ 2008-05-10 3:31 UTC (permalink / raw) To: Jeff King; +Cc: git 2008/5/9 Jeff King <peff@peff.net>: > On Fri, May 09, 2008 at 05:38:12PM +1000, James Sadler wrote: > >> I originally tried --subdirectory-filter by itself to see if it would >> do the job, but it filtered more commits than I thought it should >> (some commits that touched the subdir were missing after filter-branch >> was run). >> >> I then began to question my understanding of the semantics of >> subdirectory-filter. >> >> Is it meant to: >> A) Only keep commits where ALL of the changes in the commit only touch >> content under $DIR? >> B) Only keep commits where SOME of the changes in the commit touch >> content under $DIR? >> >> I suspected that it was behaving as A. > > My understanding is that it should behave as B. E.g.: > > git init > mkdir subdir1 subdir2 > echo content 1 >subdir1/file > echo content 2 >subdir2/file > git add . > git commit -m initial > echo changes 1 >>subdir1/file > git commit -a -m 'only one' > echo more changes 1 >>subdir1/file > echo more changes 2 >>subdir2/file > git commit -a -m 'both' > git filter-branch --subdirectory-filter subdir1 > git log --name-status --pretty=oneline > > should show something like: > > b119e21829b6039aa8fe938fb0304a9a7436b84d both > M file > db2ad8e702f36a1df99dd529aa594e756010b191 only one > M file > dacb4c2536e61c18079bcc73ea81fa0fb139c097 initial > A file > Behaving as B is definitely the desired behaviour, but I am not observing that. I'll see if I can create a test case to demonstrate. Unfortunately, I don't have the right to distribute our repo so will have to attempt to reproduce the problem another way. Does anybody have a script that can take an existing repo, and create a new one with garbled-but-equivalent commits? i.e. file and directory structure is same with names changed, and there is a one-one relationship between lines of text in new repo and old one except the lines have been scrambled? It would be a useful tool for distributing private repositories for debugging reasons. > IOW, all commits touch subdir1/file, which becomes just 'file'. > > It could be a bug in git-filter-branch. What version of git are you > using? I am using git version 1.5.5 > > -Peff > -- James ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git filter-branch --subdirectory-filter 2008-05-10 3:31 ` James Sadler @ 2008-05-10 5:53 ` Jeff King 2008-05-10 7:10 ` James Sadler 2008-05-10 11:38 ` James Sadler 0 siblings, 2 replies; 10+ messages in thread From: Jeff King @ 2008-05-10 5:53 UTC (permalink / raw) To: James Sadler; +Cc: git On Sat, May 10, 2008 at 01:31:37PM +1000, James Sadler wrote: > Does anybody have a script that can take an existing repo, and create > a new one with garbled-but-equivalent commits? i.e. file and > directory structure is same with names changed, and there is a one-one > relationship between lines of text in new repo and old one except the > lines have been scrambled? It would be a useful tool for distributing > private repositories for debugging reasons. This is only lightly tested, but the script below should do the trick. It works as an index filter which munges all content in such a way that a particular line is always given the same replacement text. That means that diffs will look approximately the same, but will add and remove lines that say "Fake line XXX" instead of the actual content. You can munge the commit messages themselves by just replacing them with some unique text; in the example below, we just replace them with the md5sum of the content. This will leave the original author, committer, and date, which is presumably non-proprietary. -- >8 -- #!/usr/bin/perl # # Obscure a repository while still maintaining the same history # structure and diffs. # # Invoke as: # git filter-branch \ # --msg-filter md5sum \ # --index-filter /path/to/this/script use strict; use IPC::Open2; use DB_File; use Fcntl; tie my %blob_cache, 'DB_File', 'blob-cache', O_RDWR|O_CREAT, 0666; tie my %line_cache, 'DB_File', 'line-cache', O_RDWR|O_CREAT, 0666; open(my $lsfiles, '-|', qw(git ls-files --stage)) or die "unable to open ls-files: $!"; open(my $update, '|-', qw(git update-index --index-info)) or die "unable to open upate-inex: $!"; while(<$lsfiles>) { my ($mode, $hash, $path) = /^(\d+) ([0-9a-f]{40}) \d\t(.*)/ or die "bad ls-files line: $_"; $blob_cache{$hash} = munge($hash) unless exists $blob_cache{$hash}; print $update "$mode $blob_cache{$hash}\t$path\n"; } close($lsfiles); close($update); exit $?; sub munge { my $h = shift; open(my $in, '-|', qw(git show), $h) or die "unable to open git show: $!"; open2(my $hash, my $out, qw(git hash-object -w --stdin)); while(<$in>) { $line_cache{$_} ||= 'Fake line ' . $line_cache{CURRENT}++ . "\n"; print $out $line_cache{$_}; } close($in); close($out); my $r = <$hash>; chomp $r; return $r; } ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git filter-branch --subdirectory-filter 2008-05-10 5:53 ` Jeff King @ 2008-05-10 7:10 ` James Sadler 2008-05-10 11:38 ` James Sadler 1 sibling, 0 replies; 10+ messages in thread From: James Sadler @ 2008-05-10 7:10 UTC (permalink / raw) To: Jeff King; +Cc: git Excellent! I'll give that a whirl, thanks. - James. 2008/5/10 Jeff King <peff@peff.net>: > On Sat, May 10, 2008 at 01:31:37PM +1000, James Sadler wrote: > >> Does anybody have a script that can take an existing repo, and create >> a new one with garbled-but-equivalent commits? i.e. file and >> directory structure is same with names changed, and there is a one-one >> relationship between lines of text in new repo and old one except the >> lines have been scrambled? It would be a useful tool for distributing >> private repositories for debugging reasons. > > This is only lightly tested, but the script below should do the trick. > It works as an index filter which munges all content in such a way that > a particular line is always given the same replacement text. That means > that diffs will look approximately the same, but will add and remove > lines that say "Fake line XXX" instead of the actual content. > > You can munge the commit messages themselves by just replacing them with > some unique text; in the example below, we just replace them with the > md5sum of the content. > > This will leave the original author, committer, and date, which is > presumably non-proprietary. > > -- >8 -- > #!/usr/bin/perl > # > # Obscure a repository while still maintaining the same history > # structure and diffs. > # > # Invoke as: > # git filter-branch \ > # --msg-filter md5sum \ > # --index-filter /path/to/this/script > > use strict; > use IPC::Open2; > use DB_File; > use Fcntl; > tie my %blob_cache, 'DB_File', 'blob-cache', O_RDWR|O_CREAT, 0666; > tie my %line_cache, 'DB_File', 'line-cache', O_RDWR|O_CREAT, 0666; > > open(my $lsfiles, '-|', qw(git ls-files --stage)) > or die "unable to open ls-files: $!"; > open(my $update, '|-', qw(git update-index --index-info)) > or die "unable to open upate-inex: $!"; > > while(<$lsfiles>) { > my ($mode, $hash, $path) = /^(\d+) ([0-9a-f]{40}) \d\t(.*)/ > or die "bad ls-files line: $_"; > $blob_cache{$hash} = munge($hash) > unless exists $blob_cache{$hash}; > print $update "$mode $blob_cache{$hash}\t$path\n"; > } > > close($lsfiles); > close($update); > exit $?; > > sub munge { > my $h = shift; > > open(my $in, '-|', qw(git show), $h) > or die "unable to open git show: $!"; > open2(my $hash, my $out, qw(git hash-object -w --stdin)); > > while(<$in>) { > $line_cache{$_} ||= 'Fake line ' . $line_cache{CURRENT}++ . "\n"; > print $out $line_cache{$_}; > } > > close($in); > close($out); > > my $r = <$hash>; > chomp $r; > return $r; > } > -- James ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git filter-branch --subdirectory-filter 2008-05-10 5:53 ` Jeff King 2008-05-10 7:10 ` James Sadler @ 2008-05-10 11:38 ` James Sadler 2008-05-10 11:44 ` Jeff King 1 sibling, 1 reply; 10+ messages in thread From: James Sadler @ 2008-05-10 11:38 UTC (permalink / raw) To: Jeff King; +Cc: git 2008/5/10 Jeff King <peff@peff.net>: > On Sat, May 10, 2008 at 01:31:37PM +1000, James Sadler wrote: > >> Does anybody have a script that can take an existing repo, and create >> a new one with garbled-but-equivalent commits? i.e. file and >> directory structure is same with names changed, and there is a one-one >> relationship between lines of text in new repo and old one except the >> lines have been scrambled? It would be a useful tool for distributing >> private repositories for debugging reasons. > > This is only lightly tested, but the script below should do the trick. > It works as an index filter which munges all content in such a way that > a particular line is always given the same replacement text. That means > that diffs will look approximately the same, but will add and remove > lines that say "Fake line XXX" instead of the actual content. > > You can munge the commit messages themselves by just replacing them with > some unique text; in the example below, we just replace them with the > md5sum of the content. > > This will leave the original author, committer, and date, which is > presumably non-proprietary. > > <snip> Jeff, I have run your script on my repo and now have an obfuscated version. When I run 'git filter-branch -subdirectory filter $DIR' on this repo, the same problem occurs, i.e. there are fewer commits remaining than I would expect. If I place this repo somewhere you can download it, would you be kind enough to take a look? I'll detail the steps required to reproduce in another post. Thanks, James ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git filter-branch --subdirectory-filter 2008-05-10 11:38 ` James Sadler @ 2008-05-10 11:44 ` Jeff King 0 siblings, 0 replies; 10+ messages in thread From: Jeff King @ 2008-05-10 11:44 UTC (permalink / raw) To: James Sadler; +Cc: git On Sat, May 10, 2008 at 09:38:59PM +1000, James Sadler wrote: > I have run your script on my repo and now have an obfuscated version. > When I run 'git filter-branch -subdirectory filter $DIR' on this repo, > the same problem occurs, i.e. there are fewer commits remaining than I > would expect. Great, I'm glad the obfuscation worked. > If I place this repo somewhere you can download it, would you be kind > enough to take a look? I'll detail the steps required to reproduce in > another post. Sure. -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2008-05-10 11:45 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-09 1:01 git filter-branch --subdirectory-filter James Sadler 2008-05-09 1:33 ` Jeff King 2008-05-09 7:38 ` James Sadler 2008-05-09 7:57 ` Johannes Sixt 2008-05-09 8:00 ` Jeff King 2008-05-10 3:31 ` James Sadler 2008-05-10 5:53 ` Jeff King 2008-05-10 7:10 ` James Sadler 2008-05-10 11:38 ` James Sadler 2008-05-10 11:44 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).