* Fwd: Local clones aka forks disk size optimization
[not found] <CALZVapmG+HL0SQx8zx=Cfz5pWv84hJq90x-7VdjA0m2Z4dC34A@mail.gmail.com>
@ 2012-11-14 23:42 ` Javier Domingo
2012-11-15 0:18 ` Andrew Ardill
[not found] ` <CAKs0BQ7RyLZr+ZU=hAC4U7xXpE0+xvORTrvfzFYb6muA2TgM4Q@mail.gmail.com>
0 siblings, 2 replies; 13+ messages in thread
From: Javier Domingo @ 2012-11-14 23:42 UTC (permalink / raw)
To: git
Hi,
I have come up with this while doing some local forks for work.
Currently, when you clone a repo using a path (not file:/// protocol)
you get all the common objects linked.
But as you work, each one will continue growing on its way, although
they may have common objects.
Is there any way to avoid this? I mean, can something be done in git,
that it checks for (when pulling) the same objects in the other forks?
Thought this doesn't make much sense in clients, when you have to
maintain 20 forks of very big projects in server side, it eats
precious disk space.
I don't know how if this should have [RFC] in the subject or what. But
here is my idea.
As hardlinking is already done by git, if it checked for how many
links there are for its files, it would be able to find other dirs
where to search. The easier way is checking for the most ancient pack.
Hope you like this idea,
Javier Domingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-14 23:42 ` Fwd: Local clones aka forks disk size optimization Javier Domingo
@ 2012-11-15 0:18 ` Andrew Ardill
2012-11-15 0:40 ` Javier Domingo
[not found] ` <CAKs0BQ7RyLZr+ZU=hAC4U7xXpE0+xvORTrvfzFYb6muA2TgM4Q@mail.gmail.com>
1 sibling, 1 reply; 13+ messages in thread
From: Andrew Ardill @ 2012-11-15 0:18 UTC (permalink / raw)
To: Javier Domingo; +Cc: git@vger.kernel.org
On 15 November 2012 10:42, Javier Domingo <javierdo1@gmail.com> wrote:
> Hi,
>
> I have come up with this while doing some local forks for work.
> Currently, when you clone a repo using a path (not file:/// protocol)
> you get all the common objects linked.
>
> But as you work, each one will continue growing on its way, although
> they may have common objects.
>
> Is there any way to avoid this? I mean, can something be done in git,
> that it checks for (when pulling) the same objects in the other forks?
Have you seen alternates? From [1]:
> How to share objects between existing repositories?
> ---------------------------------------------------------------------------
>
> Do
>
> echo "/source/git/project/.git/objects/" > .git/objects/info/alternates
>
> and then follow it up with
>
> git repack -a -d -l
>
> where the '-l' means that it will only put local objects in the pack-file
> (strictly speaking, it will put any loose objects from the alternate tree
> too, so you'll have a fully packed archive, but it won't duplicate objects
> that are already packed in the alternate tree).
[1] https://git.wiki.kernel.org/index.php/GitFaq#How_to_share_objects_between_existing_repositories.3F
Regards,
Andrew Ardill
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-15 0:18 ` Andrew Ardill
@ 2012-11-15 0:40 ` Javier Domingo
2012-11-15 0:53 ` Andrew Ardill
0 siblings, 1 reply; 13+ messages in thread
From: Javier Domingo @ 2012-11-15 0:40 UTC (permalink / raw)
To: Andrew Ardill; +Cc: git@vger.kernel.org
Hi Andrew,
The problem about that, is that if I want to delete the first repo, I
will loose objects... Or does that repack also hard-link the objects
in other repos? I don't want to accidentally loose data, so it would
be nice that althought avoided to repack things, it would also
hardlink them.
Javier Domingo
2012/11/15 Andrew Ardill <andrew.ardill@gmail.com>:
> On 15 November 2012 10:42, Javier Domingo <javierdo1@gmail.com> wrote:
>> Hi,
>>
>> I have come up with this while doing some local forks for work.
>> Currently, when you clone a repo using a path (not file:/// protocol)
>> you get all the common objects linked.
>>
>> But as you work, each one will continue growing on its way, although
>> they may have common objects.
>>
>> Is there any way to avoid this? I mean, can something be done in git,
>> that it checks for (when pulling) the same objects in the other forks?
>
> Have you seen alternates? From [1]:
>
>> How to share objects between existing repositories?
>> ---------------------------------------------------------------------------
>>
>> Do
>>
>> echo "/source/git/project/.git/objects/" > .git/objects/info/alternates
>>
>> and then follow it up with
>>
>> git repack -a -d -l
>>
>> where the '-l' means that it will only put local objects in the pack-file
>> (strictly speaking, it will put any loose objects from the alternate tree
>> too, so you'll have a fully packed archive, but it won't duplicate objects
>> that are already packed in the alternate tree).
>
> [1] https://git.wiki.kernel.org/index.php/GitFaq#How_to_share_objects_between_existing_repositories.3F
>
>
> Regards,
>
> Andrew Ardill
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-15 0:40 ` Javier Domingo
@ 2012-11-15 0:53 ` Andrew Ardill
2012-11-15 1:15 ` Javier Domingo
0 siblings, 1 reply; 13+ messages in thread
From: Andrew Ardill @ 2012-11-15 0:53 UTC (permalink / raw)
To: Javier Domingo; +Cc: git@vger.kernel.org
On 15 November 2012 11:40, Javier Domingo <javierdo1@gmail.com> wrote:
> Hi Andrew,
>
> The problem about that, is that if I want to delete the first repo, I
> will loose objects... Or does that repack also hard-link the objects
> in other repos? I don't want to accidentally loose data, so it would
> be nice that althought avoided to repack things, it would also
> hardlink them.
Hi Javier, check out the section below the one I linked earlier:
> How to stop sharing objects between repositories?
>
> To copy the shared objects into the local repository, repack without the -l flag
>
> git repack -a
>
> Then remove the pointer to the alternate object store
>
> rm .git/objects/info/alternates
>
> (If the repository is edited between the two steps, it could become corrupted
> when the alternates file is removed. If you're unsure, you can use git fsck to
> check for corruption. If things go wrong, you can always recover by replacing
> the alternates file and starting over).
Regards,
Andrew Ardill
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-15 0:53 ` Andrew Ardill
@ 2012-11-15 1:15 ` Javier Domingo
2012-11-15 1:34 ` Andrew Ardill
2012-11-16 14:55 ` Pyeron, Jason J CTR (US)
0 siblings, 2 replies; 13+ messages in thread
From: Javier Domingo @ 2012-11-15 1:15 UTC (permalink / raw)
To: Andrew Ardill; +Cc: git@vger.kernel.org
Hi Andrew,
Doing this would require I got tracked which one comes from which. So
it would imply some logic (and db) over it. With the hardlinking way,
it wouldn't require anything. The idea is that you don't have to do
anything else in the server.
I understand that it would be imposible to do it for windows users
(but using cygwin), but for *nix ones yes...
Javier Domingo
2012/11/15 Andrew Ardill <andrew.ardill@gmail.com>:
> On 15 November 2012 11:40, Javier Domingo <javierdo1@gmail.com> wrote:
>> Hi Andrew,
>>
>> The problem about that, is that if I want to delete the first repo, I
>> will loose objects... Or does that repack also hard-link the objects
>> in other repos? I don't want to accidentally loose data, so it would
>> be nice that althought avoided to repack things, it would also
>> hardlink them.
>
> Hi Javier, check out the section below the one I linked earlier:
>
>> How to stop sharing objects between repositories?
>>
>> To copy the shared objects into the local repository, repack without the -l flag
>>
>> git repack -a
>>
>> Then remove the pointer to the alternate object store
>>
>> rm .git/objects/info/alternates
>>
>> (If the repository is edited between the two steps, it could become corrupted
>> when the alternates file is removed. If you're unsure, you can use git fsck to
>> check for corruption. If things go wrong, you can always recover by replacing
>> the alternates file and starting over).
>
> Regards,
>
> Andrew Ardill
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-15 1:15 ` Javier Domingo
@ 2012-11-15 1:34 ` Andrew Ardill
2012-11-15 3:44 ` Sitaram Chamarty
2012-11-16 14:55 ` Pyeron, Jason J CTR (US)
1 sibling, 1 reply; 13+ messages in thread
From: Andrew Ardill @ 2012-11-15 1:34 UTC (permalink / raw)
To: Javier Domingo; +Cc: git@vger.kernel.org
On 15 November 2012 12:15, Javier Domingo <javierdo1@gmail.com> wrote:
> Hi Andrew,
>
> Doing this would require I got tracked which one comes from which. So
> it would imply some logic (and db) over it. With the hardlinking way,
> it wouldn't require anything. The idea is that you don't have to do
> anything else in the server.
>
> I understand that it would be imposible to do it for windows users
> (but using cygwin), but for *nix ones yes...
> Javier Domingo
Paraphrasing from git-clone(1):
When cloning a repository, if the source repository is specified with
/path/to/repo syntax, the default is to clone the repository by making
a copy of HEAD and everything under objects and refs directories. The
files under .git/objects/ directory are hardlinked to save space when
possible. To force copying instead of hardlinking (which may be
desirable if you are trying to make a back-up of your repository)
--no-hardlinks can be used.
So hardlinks should be used where possible, and if they are not try
upgrading Git.
I think that covers all the use cases you have?
Regards,
Andrew Ardill
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-15 1:34 ` Andrew Ardill
@ 2012-11-15 3:44 ` Sitaram Chamarty
2012-11-16 11:25 ` Michael J Gruber
0 siblings, 1 reply; 13+ messages in thread
From: Sitaram Chamarty @ 2012-11-15 3:44 UTC (permalink / raw)
To: Andrew Ardill; +Cc: Javier Domingo, git@vger.kernel.org
On Thu, Nov 15, 2012 at 7:04 AM, Andrew Ardill <andrew.ardill@gmail.com> wrote:
> On 15 November 2012 12:15, Javier Domingo <javierdo1@gmail.com> wrote:
>> Hi Andrew,
>>
>> Doing this would require I got tracked which one comes from which. So
>> it would imply some logic (and db) over it. With the hardlinking way,
>> it wouldn't require anything. The idea is that you don't have to do
>> anything else in the server.
>>
>> I understand that it would be imposible to do it for windows users
>> (but using cygwin), but for *nix ones yes...
>> Javier Domingo
>
> Paraphrasing from git-clone(1):
>
> When cloning a repository, if the source repository is specified with
> /path/to/repo syntax, the default is to clone the repository by making
> a copy of HEAD and everything under objects and refs directories. The
> files under .git/objects/ directory are hardlinked to save space when
> possible. To force copying instead of hardlinking (which may be
> desirable if you are trying to make a back-up of your repository)
> --no-hardlinks can be used.
>
> So hardlinks should be used where possible, and if they are not try
> upgrading Git.
>
> I think that covers all the use cases you have?
I am not sure it does. My understanding is this:
'git clone -l' saves space on the initial clone, but subsequent pushes
end up with the same objects duplicated across all the "forks"
(assuming most of the forks keep up with some canonical repo).
The alternates mechanism can give you ongoing savings (as long as you
push to the "main" repo first), but it is dangerous, in the words of
the git-clone manpage. You have to be confident no one will delete a
ref from the "main" repo and then do a gc or let it auto-gc.
He's looking for something that addresses both these issues.
As an additional idea, I suspect this is what the namespaces feature
was created for, but I am not sure, and have never played with it till
now.
Maybe someone who knows namespaces very well will chip in...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-15 3:44 ` Sitaram Chamarty
@ 2012-11-16 11:25 ` Michael J Gruber
2012-11-16 18:04 ` Enrico Weigelt
0 siblings, 1 reply; 13+ messages in thread
From: Michael J Gruber @ 2012-11-16 11:25 UTC (permalink / raw)
To: Sitaram Chamarty; +Cc: Andrew Ardill, Javier Domingo, git@vger.kernel.org
Sitaram Chamarty venit, vidit, dixit 15.11.2012 04:44:
> On Thu, Nov 15, 2012 at 7:04 AM, Andrew Ardill <andrew.ardill@gmail.com> wrote:
>> On 15 November 2012 12:15, Javier Domingo <javierdo1@gmail.com> wrote:
>>> Hi Andrew,
>>>
>>> Doing this would require I got tracked which one comes from which. So
>>> it would imply some logic (and db) over it. With the hardlinking way,
>>> it wouldn't require anything. The idea is that you don't have to do
>>> anything else in the server.
>>>
>>> I understand that it would be imposible to do it for windows users
>>> (but using cygwin), but for *nix ones yes...
>>> Javier Domingo
>>
>> Paraphrasing from git-clone(1):
>>
>> When cloning a repository, if the source repository is specified with
>> /path/to/repo syntax, the default is to clone the repository by making
>> a copy of HEAD and everything under objects and refs directories. The
>> files under .git/objects/ directory are hardlinked to save space when
>> possible. To force copying instead of hardlinking (which may be
>> desirable if you are trying to make a back-up of your repository)
>> --no-hardlinks can be used.
>>
>> So hardlinks should be used where possible, and if they are not try
>> upgrading Git.
>>
>> I think that covers all the use cases you have?
>
> I am not sure it does. My understanding is this:
>
> 'git clone -l' saves space on the initial clone, but subsequent pushes
> end up with the same objects duplicated across all the "forks"
> (assuming most of the forks keep up with some canonical repo).
>
> The alternates mechanism can give you ongoing savings (as long as you
> push to the "main" repo first), but it is dangerous, in the words of
> the git-clone manpage. You have to be confident no one will delete a
> ref from the "main" repo and then do a gc or let it auto-gc.
>
> He's looking for something that addresses both these issues.
>
> As an additional idea, I suspect this is what the namespaces feature
> was created for, but I am not sure, and have never played with it till
> now.
>
> Maybe someone who knows namespaces very well will chip in...
>
I dunno about namespaces, but a safe route with alternates seems to be:
Provide one "main" clone which is bare, pulls automatically, and is
there to stay (no pruning), so that all others can use that as a
reliable alternates source.
Michael
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: Local clones aka forks disk size optimization
2012-11-15 1:15 ` Javier Domingo
2012-11-15 1:34 ` Andrew Ardill
@ 2012-11-16 14:55 ` Pyeron, Jason J CTR (US)
1 sibling, 0 replies; 13+ messages in thread
From: Pyeron, Jason J CTR (US) @ 2012-11-16 14:55 UTC (permalink / raw)
To: git@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 612 bytes --]
> -----Original Message-----
> From: Javier Domingo
> Sent: Wednesday, November 14, 2012 8:15 PM
>
> Hi Andrew,
>
> Doing this would require I got tracked which one comes from which. So
> it would imply some logic (and db) over it. With the hardlinking way,
> it wouldn't require anything. The idea is that you don't have to do
> anything else in the server.
>
> I understand that it would be imposible to do it for windows users
Not true, it is a file system issue not an os issue. FAT does not support hard links, but ext2,3,4 and NTFS do.
> (but using cygwin), but for *nix ones yes...
> Javier Domingo
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5615 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-16 11:25 ` Michael J Gruber
@ 2012-11-16 18:04 ` Enrico Weigelt
2012-11-18 10:42 ` Sitaram Chamarty
0 siblings, 1 reply; 13+ messages in thread
From: Enrico Weigelt @ 2012-11-16 18:04 UTC (permalink / raw)
To: Michael J Gruber; +Cc: Andrew Ardill, Javier Domingo, git, Sitaram Chamarty
> Provide one "main" clone which is bare, pulls automatically, and is
> there to stay (no pruning), so that all others can use that as a
> reliable alternates source.
The problem here, IMHO, is the assumption, that the main repo will
never be cleaned up. But what to do if you dont wanna let it grow
forever ?
hmm, distributed GC is a tricky problem.
maybe it could be easier having two kind of alternates:
a) classical: gc+friends will drop local objects that are
already there
b) fallback: normal operations fetch objects if not accessible
from anywhere else, but gc+friends do not skip objects from there.
And extend prune machinery to put some backup of the dropped objects
to some separate store.
This way we could use some kind of rotating archive:
* GC'ed objects will be stored in the backup repo for some while
* there are multiple active (rotating) backups kept for some time,
each cycle, only the oldest one is dropped (and maybe objects
in a newer backup are removed from the older ones)
* downstream repos must be synced often enough, so removed objects
are fetched back from the backups early enough
You could see this as some heap:
* the currently active objects (directly referenced) are always
on the top
* once they're not referenced, they sink a lever deeper
* when the're referenced again, they immediately jump up to the top
* at some point in time unreferenced objects sink too deep that
they're dropped completely
cu
--
Mit freundlichen Grüßen / Kind regards
Enrico Weigelt
VNC - Virtual Network Consult GmbH
Head Of Development
Pariser Platz 4a, D-10117 Berlin
Tel.: +49 (30) 3464615-20
Fax: +49 (30) 3464615-59
enrico.weigelt@vnc.biz; www.vnc.de
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-16 18:04 ` Enrico Weigelt
@ 2012-11-18 10:42 ` Sitaram Chamarty
2012-11-18 17:02 ` Enrico Weigelt
0 siblings, 1 reply; 13+ messages in thread
From: Sitaram Chamarty @ 2012-11-18 10:42 UTC (permalink / raw)
To: Enrico Weigelt; +Cc: Michael J Gruber, Andrew Ardill, Javier Domingo, git
On Fri, Nov 16, 2012 at 11:34 PM, Enrico Weigelt <enrico.weigelt@vnc.biz> wrote:
>
>> Provide one "main" clone which is bare, pulls automatically, and is
>> there to stay (no pruning), so that all others can use that as a
>> reliable alternates source.
>
> The problem here, IMHO, is the assumption, that the main repo will
> never be cleaned up. But what to do if you dont wanna let it grow
> forever ?
That's not the only problem. I believe you only get the savings when
the main repo gets the commits first. Which is probably ok most of
the time but it's worth mentioning.
>
> hmm, distributed GC is a tricky problem.
Except for one little issue (see other thread, subject line "cloning a
namespace downloads all the objects"), namespaces appear to do
everything we want in terms of the typical use cases for alternates,
and/or 'git clone -l', at least on the server side.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
2012-11-18 10:42 ` Sitaram Chamarty
@ 2012-11-18 17:02 ` Enrico Weigelt
0 siblings, 0 replies; 13+ messages in thread
From: Enrico Weigelt @ 2012-11-18 17:02 UTC (permalink / raw)
To: Sitaram Chamarty; +Cc: Michael J Gruber, Andrew Ardill, Javier Domingo, git
Hi,
> That's not the only problem. I believe you only get the savings when
> the main repo gets the commits first. Which is probably ok most of
> the time but it's worth mentioning.
Well, the saving will just be deferred to the point where the commit
finally went to the main repo and downstreams are gc'ed.
> > hmm, distributed GC is a tricky problem.
>
> Except for one little issue (see other thread, subject line "cloning
> a
> namespace downloads all the objects"), namespaces appear to do
> everything we want in terms of the typical use cases for alternates,
> and/or 'git clone -l', at least on the server side.
hmm, not sure about the actual internals, but that namespace filtering
should work in a way that local clone should never see (or consider)
remote refs that are outside of the requested namespace. Perhaps that
should be handled entirely on server side, so all called commands treat
these refs as nonexisting.
By the way: what happens if one tries to clone from an broken repo
(which has several refs pointing to nonexisting objects ?
cu
--
Mit freundlichen Grüßen / Kind regards
Enrico Weigelt
VNC - Virtual Network Consult GmbH
Head Of Development
Pariser Platz 4a, D-10117 Berlin
Tel.: +49 (30) 3464615-20
Fax: +49 (30) 3464615-59
enrico.weigelt@vnc.biz; www.vnc.de
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization
[not found] ` <CAKs0BQ7RyLZr+ZU=hAC4U7xXpE0+xvORTrvfzFYb6muA2TgM4Q@mail.gmail.com>
@ 2012-11-18 17:18 ` Jörg Rosenkranz
0 siblings, 0 replies; 13+ messages in thread
From: Jörg Rosenkranz @ 2012-11-18 17:18 UTC (permalink / raw)
To: Javier Domingo; +Cc: git
2012/11/15 Javier Domingo <javierdo1@gmail.com>
>
> Is there any way to avoid this? I mean, can something be done in git,
> that it checks for (when pulling) the same objects in the other forks?
I've been using git-new-workdir
(https://github.com/git/git/blob/master/contrib/workdir/git-new-workdir)
for a similar problem. Maybe that's what you're searching?
Joerg.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-11-18 17:19 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CALZVapmG+HL0SQx8zx=Cfz5pWv84hJq90x-7VdjA0m2Z4dC34A@mail.gmail.com>
2012-11-14 23:42 ` Fwd: Local clones aka forks disk size optimization Javier Domingo
2012-11-15 0:18 ` Andrew Ardill
2012-11-15 0:40 ` Javier Domingo
2012-11-15 0:53 ` Andrew Ardill
2012-11-15 1:15 ` Javier Domingo
2012-11-15 1:34 ` Andrew Ardill
2012-11-15 3:44 ` Sitaram Chamarty
2012-11-16 11:25 ` Michael J Gruber
2012-11-16 18:04 ` Enrico Weigelt
2012-11-18 10:42 ` Sitaram Chamarty
2012-11-18 17:02 ` Enrico Weigelt
2012-11-16 14:55 ` Pyeron, Jason J CTR (US)
[not found] ` <CAKs0BQ7RyLZr+ZU=hAC4U7xXpE0+xvORTrvfzFYb6muA2TgM4Q@mail.gmail.com>
2012-11-18 17:18 ` Jörg Rosenkranz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).