* git subtree as a solution to partial cloning?
@ 2009-05-25 7:35 Asger Ottar Alstrup
0 siblings, 0 replies; 9+ messages in thread
From: Asger Ottar Alstrup @ 2009-05-25 7:35 UTC (permalink / raw)
To: git; +Cc: Avery Pennarun, Alexander Gavrilov
I am considering different ways to get git to handle repositories with
very big files in a setup where partial clone is required, and it
seems git subtree might be a part of the solution.
Does git subtree support splitting at the file level, or only at
directory level? Also, how are conflicts handled when you subtree
merge changes back to the master? For this to work in practice, I
suppose the users of the split repositories should see the conflicts
and fix them themselves. Can the reduced split repositories reuse pack
files from the original repository? Can you think of any other
limitations to git subtree that would prevent it from working with big
files to support a partial cloning setup?
The alternative seems to be git sparse checkout extended with
non-existing narrow clone, but it seems that a git subtree based
approach might be simpler.
Regards,
Asger Ottar Alstrup
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git subtree as a solution to partial cloning?
[not found] <8873ae500905250021p20e7096dwf5bc71c36c4047b@mail.gmail.com>
@ 2009-05-25 7:59 ` Avery Pennarun
2009-05-25 9:33 ` Asger Ottar Alstrup
0 siblings, 1 reply; 9+ messages in thread
From: Avery Pennarun @ 2009-05-25 7:59 UTC (permalink / raw)
To: Asger Ottar Alstrup; +Cc: git, Alexander Gavrilov
On Mon, May 25, 2009 at 3:21 AM, Asger Ottar Alstrup <asger@area9.dk> wrote:
> I am considering different ways to get git to handle repositories with very
> big files in a setup where partial clone is required, and it seems git
> subtree might be a part of the solution.
Well, that wasn't really what it was originally made for... but perhaps.
> Does git subtree support splitting at the file level, or only at directory
> level?
Currently only at the directory level. In theory, there's nothing
stopping us from working with any subset of files... but it's really
much simpler this way (both to code and to explain) so I'd much rather
leave it as is. Can you reorganize your tree so that you divide the
needed files into different subdirectories?
> Also, how are conflicts handled when you subtree merge changes back
> to the master?
'git subtree split' generates a new commit history on top of the *most
recently merged* commit from the subproject. To merge back into the
subproject, you would take that newly-generated commit and do the
usual "git merge". (ie. you'll have to check out the branch and merge
it as usual)
Alternatively, you could 'git subtree pull' the subproject first,
resolve the conflicts there, and 'git subtree split' after that; in
such a case, the newly-generated commit would be a fast-forward from
the original subproject's HEAD, so it would be okay to push right away
without switching branches first.
(Someone else suggested that we add a 'git subtree push' command to
make the split-then-push sequence nice and obvious; I think that's a
good idea and pretty easy.)
> Can the
> reduced split repositories reuse pack files from the original repository?
Yes, all the tree and blob objects are identical between the two
repositories (except that the superproject has more of them, of
course).
Have fun,
Avery
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git subtree as a solution to partial cloning?
2009-05-25 7:59 ` git subtree as a solution to partial cloning? Avery Pennarun
@ 2009-05-25 9:33 ` Asger Ottar Alstrup
2009-05-25 15:50 ` Avery Pennarun
0 siblings, 1 reply; 9+ messages in thread
From: Asger Ottar Alstrup @ 2009-05-25 9:33 UTC (permalink / raw)
To: Avery Pennarun; +Cc: git, Alexander Gavrilov
On Mon, May 25, 2009 at 9:59 AM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Mon, May 25, 2009 at 3:21 AM, Asger Ottar Alstrup <asger@area9.dk> wrote:
>> Does git subtree support splitting at the file level, or only at directory
>> level?
>
> Currently only at the directory level. In theory, there's nothing
> stopping us from working with any subset of files... but it's really
> much simpler this way (both to code and to explain) so I'd much rather
> leave it as is. Can you reorganize your tree so that you divide the
> needed files into different subdirectories?
No, that is unfortunately not so easy. If we could, I suppose we could
use submodules instead.
Are the subtree split and merge operations effective? I.e. how do they
scale with the size of the original and reduced repositories? I.e. is
it feasible to use hooks to automate the splitting and merging
whenever there are changes in the original or reduced repositories?
Regards,
Asger
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git subtree as a solution to partial cloning?
2009-05-25 9:33 ` Asger Ottar Alstrup
@ 2009-05-25 15:50 ` Avery Pennarun
2009-05-25 17:35 ` Asger Ottar Alstrup
0 siblings, 1 reply; 9+ messages in thread
From: Avery Pennarun @ 2009-05-25 15:50 UTC (permalink / raw)
To: Asger Ottar Alstrup; +Cc: git, Alexander Gavrilov
On Mon, May 25, 2009 at 5:33 AM, Asger Ottar Alstrup <asger@area9.dk> wrote:
> On Mon, May 25, 2009 at 9:59 AM, Avery Pennarun <apenwarr@gmail.com> wrote:
>> On Mon, May 25, 2009 at 3:21 AM, Asger Ottar Alstrup <asger@area9.dk> wrote:
>>> Does git subtree support splitting at the file level, or only at directory
>>> level?
>>
>> Currently only at the directory level. In theory, there's nothing
>> stopping us from working with any subset of files... but it's really
>> much simpler this way (both to code and to explain) so I'd much rather
>> leave it as is. Can you reorganize your tree so that you divide the
>> needed files into different subdirectories?
>
> No, that is unfortunately not so easy. If we could, I suppose we could
> use submodules instead.
Your only option may be to use git filter-branch then. It lets you do
pretty much anything you want, although merging it back together again
could be entertaining. (Making it correctly mergeable is by far the
trickiest part of git-subtree.)
> Are the subtree split and merge operations effective? I.e. how do they
> scale with the size of the original and reduced repositories? I.e. is
> it feasible to use hooks to automate the splitting and merging
> whenever there are changes in the original or reduced repositories?
git subtree manipulates only commit objects (and a reference to the
single tree object representing the subtree in each commit) so it's
very fast and doesn't depend on file sizes or number of files.
Basically git subtree split is O(n) in the number of *commits* since
the most recent split.
Have fun,
Avery
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git subtree as a solution to partial cloning?
2009-05-25 15:50 ` Avery Pennarun
@ 2009-05-25 17:35 ` Asger Ottar Alstrup
2009-05-25 17:54 ` Avery Pennarun
0 siblings, 1 reply; 9+ messages in thread
From: Asger Ottar Alstrup @ 2009-05-25 17:35 UTC (permalink / raw)
To: Avery Pennarun; +Cc: git, Alexander Gavrilov
On Mon, May 25, 2009 at 5:50 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Mon, May 25, 2009 at 5:33 AM, Asger Ottar Alstrup <asger@area9.dk> wrote:
>> No, that is unfortunately not so easy. If we could, I suppose we could
>> use submodules instead.
>
> Your only option may be to use git filter-branch then. It lets you do
> pretty much anything you want, although merging it back together again
> could be entertaining. (Making it correctly mergeable is by far the
> trickiest part of git-subtree.)
OK, so git subtree is not usable as it is for this. Instead, it seems
a new system has to be developed which would be similar to git subtree
in spirit, except that it worked at a file-level. Of course, the git
merge subtree strategy can not be used, so merging has to be done
differently.
So a poor mans system could work like this:
- A reduced repository is defined by a list of paths in a file, I
guess with a format similar to .gitignore
- To extract: A copy of the original repository is made. This copy is
reduced using git filter-branch. Is there some way of turning a
.gitignore syntax file into a concrete list of files? Also, can this
entire step be done in one step without the copy? Having to copy the
entire project first seems excessive. Will filter-branch preserve
and/or prune pack files intelligently?
- To merge from the reduced to the original: The very simple version
is just to copy all the files from the reduced repository into a
checkout of the original repository, and then merge. This would not
support removal (or renaming) of files, but that might be ok in my
setup. If this needs to be more intelligent, the list of files in the
reduced repository could be compared with the list of paths that was
used to reduce it originally. This can be used to detect removals and
additions of files.
- To merge from the original to the reduced: First merge the other
way, and then extract again.
I am new to git, so please excuse me if this design is mentally unsound.
Regards,
Asger
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git subtree as a solution to partial cloning?
2009-05-25 17:35 ` Asger Ottar Alstrup
@ 2009-05-25 17:54 ` Avery Pennarun
2009-05-25 18:28 ` Asger Ottar Alstrup
0 siblings, 1 reply; 9+ messages in thread
From: Avery Pennarun @ 2009-05-25 17:54 UTC (permalink / raw)
To: Asger Ottar Alstrup; +Cc: git, Alexander Gavrilov
On Mon, May 25, 2009 at 1:35 PM, Asger Ottar Alstrup <asger@area9.dk> wrote:
> On Mon, May 25, 2009 at 5:50 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>> On Mon, May 25, 2009 at 5:33 AM, Asger Ottar Alstrup <asger@area9.dk> wrote:
>>> No, that is unfortunately not so easy. If we could, I suppose we could
>>> use submodules instead.
>>
>> Your only option may be to use git filter-branch then. It lets you do
>> pretty much anything you want, although merging it back together again
>> could be entertaining. (Making it correctly mergeable is by far the
>> trickiest part of git-subtree.)
>
> OK, so git subtree is not usable as it is for this. Instead, it seems
> a new system has to be developed which would be similar to git subtree
> in spirit, except that it worked at a file-level. Of course, the git
> merge subtree strategy can not be used, so merging has to be done
> differently.
That sounds about right.
> So a poor mans system could work like this:
>
> - A reduced repository is defined by a list of paths in a file, I
> guess with a format similar to .gitignore
Are you sure you want to define the list with exclusions instead of
inclusions? I don't really know your use case.
Anyway, if you're using git filter-branch, it'll be up to you to fix
the index to contain the list of files you want. (See man
git-filter-branch)
> - To extract: A copy of the original repository is made. This copy is
> reduced using git filter-branch. Is there some way of turning a
> .gitignore syntax file into a concrete list of files? Also, can this
> entire step be done in one step without the copy? Having to copy the
> entire project first seems excessive. Will filter-branch preserve
> and/or prune pack files intelligently?
You probably need to read about the differences between git trees,
blobs, and commits. You're not actually "copying" anything; you're
just creating some new directory structures that contain the
*existing* blobs. And of course the existing blobs are in your
existing packs.
This is a pretty good introduction:
http://eagain.net/articles/git-for-computer-scientists/
> - To merge from the reduced to the original: The very simple version
> is just to copy all the files from the reduced repository into a
> checkout of the original repository, and then merge. This would not
> support removal (or renaming) of files, but that might be ok in my
> setup. If this needs to be more intelligent, the list of files in the
> reduced repository could be compared with the list of paths that was
> used to reduce it originally. This can be used to detect removals and
> additions of files.
Yes. In the slightly fancier version of this, you could just do all
your merges from subset->main and never from main->subset, and then a
simple "git merge subset" would handle the above comparison,
additions, and removals for you.
> - To merge from the original to the reduced: First merge the other
> way, and then extract again.
Yes.
> I am new to git, so please excuse me if this design is mentally unsound.
Well, you're getting pretty far out there:
- git subtree is already an experimental tool that hasn't been
accepted by most people;
- you're doing something similar to git subtree, but even more complicated;
- git is known to work badly with large files, and you have a bunch of
large files;
- git is intended to manage entire repositories at a time, and you
want a partial checkout;
- git is intended to download the entire history at once, and you (I
think) only want part of it.
By the time you're this far out, maybe what you want isn't git at all.
svn would work fine with this arrangement, and people who want
partial checkouts would rarely benefit from git's distributedness
anyway, I expect.
Have fun,
Avery
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git subtree as a solution to partial cloning?
2009-05-25 17:54 ` Avery Pennarun
@ 2009-05-25 18:28 ` Asger Ottar Alstrup
2009-05-25 19:18 ` Avery Pennarun
2009-05-25 23:26 ` Jakub Narebski
0 siblings, 2 replies; 9+ messages in thread
From: Asger Ottar Alstrup @ 2009-05-25 18:28 UTC (permalink / raw)
To: Avery Pennarun; +Cc: git, Alexander Gavrilov
On Mon, May 25, 2009 at 7:54 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Mon, May 25, 2009 at 1:35 PM, Asger Ottar Alstrup <asger@area9.dk> wrote:
>> So a poor mans system could work like this:
>>
>> - A reduced repository is defined by a list of paths in a file, I
>> guess with a format similar to .gitignore
>
> Are you sure you want to define the list with exclusions instead of
> inclusions? I don't really know your use case.
Since the .gitignore format supports !, I believe that should not make
much of a difference.
> Anyway, if you're using git filter-branch, it'll be up to you to fix
> the index to contain the list of files you want. (See man
> git-filter-branch)
Yes, sure, and that is why I asked whether there is some tool in git
that can give a list of concrete files surviving a .gitignore list of
patterns.
>> - To extract: A copy of the original repository is made. This copy is
>> reduced using git filter-branch. Is there some way of turning a
>> .gitignore syntax file into a concrete list of files? Also, can this
>> entire step be done in one step without the copy? Having to copy the
>> entire project first seems excessive. Will filter-branch preserve
>> and/or prune pack files intelligently?
>
> You probably need to read about the differences between git trees,
> blobs, and commits. You're not actually "copying" anything; you're
> just creating some new directory structures that contain the
> *existing* blobs. And of course the existing blobs are in your
> existing packs.
Thanks. OK, I see now that filter-branch will not destroy the original
repository. That is not at all obvious from reading the man page, when
the very first sentence says that it will rewrite history. But the
main point of this exercise is to reduce the size of the reduced
repository so that it can be transferred effectively. So after
filter-branch, I guess I would run clone afterwards to make the new,
smaller repository, and then the question becomes: Will clone reuse
and prune packs intelligently?
> Well, you're getting pretty far out there:
>
> - git is known to work badly with large files, and you have a bunch of
> large files;
As far as I know, git has most of the hooks needed to tune this. There
are still some weak areas where big files are read into memory
multiple times, but I have seen that people are already working on
this.
> - git is intended to manage entire repositories at a time, and you
> want a partial checkout;
The beauty of the subtree-inspired approach is of course that the
users of the reduced repositories WILL in fact be working on an entire
repository. The files are luckily fairly independent in THEIR
workflow. Also, if the mirror-sync proposal gets implemented, one
important part of the distribution piece is also solved: In effect,
these systems combined would give us a kind of narrow-clone.
> - git is intended to download the entire history at once, and you (I
> think) only want part of it.
I do need the entire history for the reduced files.
> By the time you're this far out, maybe what you want isn't git at all.
> svn would work fine with this arrangement, and people who want
> partial checkouts would rarely benefit from git's distributedness
> anyway, I expect.
In my use case, some people will need to work on the full repository,
and they obviously will have the network and the machines to handle
this. I am currently thinking these people would use something like
glusterfs until mirrorsync is able to solve the problem for us.
However, there is a large group of users that do not need this, but
they DO need the entire history of the files they are interested in.
Subversion does not provide this. Also, Subversion is simply too slow
to handle the kind of files we need to work with. Also, we have run
tests on the kind of files we have, and the delta compression that git
uses is very effective for compression the pdf and openoffice
documents we use. The big files we have are primarily image files, and
obviously they do not compress very well. Fortunately, they do not
change much either.
While git might not currently be designed to support this use case, it
still seems like the best system to base this on. Yes, it will need
some work before we can use it for our needs, but it seems it is still
less work than what is needed to get other systems to support our
needs.
I appreciate your comments. They are very helpful.
Regards,
Asger
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git subtree as a solution to partial cloning?
2009-05-25 18:28 ` Asger Ottar Alstrup
@ 2009-05-25 19:18 ` Avery Pennarun
2009-05-25 23:26 ` Jakub Narebski
1 sibling, 0 replies; 9+ messages in thread
From: Avery Pennarun @ 2009-05-25 19:18 UTC (permalink / raw)
To: Asger Ottar Alstrup; +Cc: git, Alexander Gavrilov
On Mon, May 25, 2009 at 2:28 PM, Asger Ottar Alstrup <asger@area9.dk> wrote:
> On Mon, May 25, 2009 at 7:54 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>> On Mon, May 25, 2009 at 1:35 PM, Asger Ottar Alstrup <asger@area9.dk> wrote:
>>> So a poor mans system could work like this:
>>>
>>> - A reduced repository is defined by a list of paths in a file, I
>>> guess with a format similar to .gitignore
>>
>> Are you sure you want to define the list with exclusions instead of
>> inclusions? I don't really know your use case.
>
> Since the .gitignore format supports !, I believe that should not make
> much of a difference.
>
>> Anyway, if you're using git filter-branch, it'll be up to you to fix
>> the index to contain the list of files you want. (See man
>> git-filter-branch)
>
> Yes, sure, and that is why I asked whether there is some tool in git
> that can give a list of concrete files surviving a .gitignore list of
> patterns.
Well, the problem here is with the definition of "concrete file." If
you're using git filter-branch --index-filter (which is much faster
than --tree-filter), then your trees won't be checked out at all. And
thus there is the open question of exactly what list of files you want
to use. It's unlikely that any existing tool will do it exactly the
way you want (although I could be wrong).
In any case, what you'd probably do is something like git ls-files
--cached | perlscript, where your perlscript does whatever you want to
the file list.
> Thanks. OK, I see now that filter-branch will not destroy the original
> repository. That is not at all obvious from reading the man page, when
> the very first sentence says that it will rewrite history. But the
> main point of this exercise is to reduce the size of the reduced
> repository so that it can be transferred effectively. So after
> filter-branch, I guess I would run clone afterwards to make the new,
> smaller repository, and then the question becomes: Will clone reuse
> and prune packs intelligently?
filter-branch will destroy the history of the current branch. But if
you make a new branch first, you'll be okay.
You seem to be giving the concept of "packs" a bit too much weight.
Packs are just groups of objects. AFAIK, cloning and fetching
generally produces entirely new packs for each client.
clone is quite intelligent; in fact, if you clone the repository on
your local machine, it's so intelligent that it'll hardlink the packs
instead of copying them and it'll take virtually no space at all!
But you don't need to copy the whole repository unless you want to.
You can retrieve just the one, stripped-down branch from a client with
something like this:
mkdir myproj
cd myproj
git init
git fetch server:whatever.git my-stripped-down-branchname
git checkout -b master FETCH_HEAD
Have fun,
Avery
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git subtree as a solution to partial cloning?
2009-05-25 18:28 ` Asger Ottar Alstrup
2009-05-25 19:18 ` Avery Pennarun
@ 2009-05-25 23:26 ` Jakub Narebski
1 sibling, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2009-05-25 23:26 UTC (permalink / raw)
To: Asger Ottar Alstrup; +Cc: Avery Pennarun, git, Alexander Gavrilov
Asger Ottar Alstrup <asger@area9.dk> writes:
> On Mon, May 25, 2009 at 7:54 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>> On Mon, May 25, 2009 at 1:35 PM, Asger Ottar Alstrup <asger@area9.dk> wrote:
>>> So a poor mans system could work like this:
>>>
>>> - A reduced repository is defined by a list of paths in a file, I
>>> guess with a format similar to .gitignore
>>
>> Are you sure you want to define the list with exclusions instead of
>> inclusions? I don't really know your use case.
>
> Since the .gitignore format supports !, I believe that should not make
> much of a difference.
>
>> Anyway, if you're using git filter-branch, it'll be up to you to fix
>> the index to contain the list of files you want. (See man
>> git-filter-branch)
>
> Yes, sure, and that is why I asked whether there is some tool in git
> that can give a list of concrete files surviving a .gitignore list of
> patterns.
I think you would want to use git-ls-files, using --exclude-from=<file>
option, and perhaps also -i/--ignored to create list of files to be
removed (using git-update-index) instead of list of files to be kept.
>>> - To extract: A copy of the original repository is made. This copy is
>>> reduced using git filter-branch. Is there some way of turning a
>>> .gitignore syntax file into a concrete list of files? Also, can this
>>> entire step be done in one step without the copy? Having to copy the
>>> entire project first seems excessive. Will filter-branch preserve
>>> and/or prune pack files intelligently?
>>
>> You probably need to read about the differences between git trees,
>> blobs, and commits. You're not actually "copying" anything; you're
>> just creating some new directory structures that contain the
>> *existing* blobs. And of course the existing blobs are in your
>> existing packs.
>
> Thanks. OK, I see now that filter-branch will not destroy the original
> repository. That is not at all obvious from reading the man page, when
> the very first sentence says that it will rewrite history.
What git-filter-branch does is to write _new_ history, and move old
history to refs/original/* namespace (that might have changed; anyway
the old history should be available via reflog). The visible efect
is that history got rewritten.
> But the
> main point of this exercise is to reduce the size of the reduced
> repository so that it can be transferred effectively. So after
> filter-branch, I guess I would run clone afterwards to make the new,
> smaller repository, and then the question becomes: Will clone reuse
> and prune packs intelligently?
Yes, it would... well, you have to take into account that ordinary
clone over local filesystem does hardlinking of packfiles, and you
need to use file:// trick to force repack; also you might want to use
--reference to set up alternates.
But that is not necessary: if you want to push effectively _subset_
of branches, you can define remote infor in appropriate way and push
would intelligently transfer only needed objects.
[...]
> However, there is a large group of users that do not need this, but
> they DO need the entire history of the files they are interested in.
> Subversion does not provide this. Also, Subversion is simply too slow
> to handle the kind of files we need to work with. Also, we have run
> tests on the kind of files we have, and the delta compression that git
> uses is very effective for compression the pdf and openoffice
> documents we use. The big files we have are primarily image files, and
> obviously they do not compress very well. Fortunately, they do not
> change much either.
You might want to turn off deltaification for binary files via `delta`
gitattribute; it might help (it might not).
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-05-25 23:26 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <8873ae500905250021p20e7096dwf5bc71c36c4047b@mail.gmail.com>
2009-05-25 7:59 ` git subtree as a solution to partial cloning? Avery Pennarun
2009-05-25 9:33 ` Asger Ottar Alstrup
2009-05-25 15:50 ` Avery Pennarun
2009-05-25 17:35 ` Asger Ottar Alstrup
2009-05-25 17:54 ` Avery Pennarun
2009-05-25 18:28 ` Asger Ottar Alstrup
2009-05-25 19:18 ` Avery Pennarun
2009-05-25 23:26 ` Jakub Narebski
2009-05-25 7:35 Asger Ottar Alstrup
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).