* Fwd: Local clones aka forks disk size optimization [not found] <CALZVapmG+HL0SQx8zx=Cfz5pWv84hJq90x-7VdjA0m2Z4dC34A@mail.gmail.com> @ 2012-11-14 23:42 ` Javier Domingo 2012-11-15 0:18 ` Andrew Ardill [not found] ` <CAKs0BQ7RyLZr+ZU=hAC4U7xXpE0+xvORTrvfzFYb6muA2TgM4Q@mail.gmail.com> 0 siblings, 2 replies; 13+ messages in thread From: Javier Domingo @ 2012-11-14 23:42 UTC (permalink / raw) To: git Hi, I have come up with this while doing some local forks for work. Currently, when you clone a repo using a path (not file:/// protocol) you get all the common objects linked. But as you work, each one will continue growing on its way, although they may have common objects. Is there any way to avoid this? I mean, can something be done in git, that it checks for (when pulling) the same objects in the other forks? Thought this doesn't make much sense in clients, when you have to maintain 20 forks of very big projects in server side, it eats precious disk space. I don't know how if this should have [RFC] in the subject or what. But here is my idea. As hardlinking is already done by git, if it checked for how many links there are for its files, it would be able to find other dirs where to search. The easier way is checking for the most ancient pack. Hope you like this idea, Javier Domingo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-14 23:42 ` Fwd: Local clones aka forks disk size optimization Javier Domingo @ 2012-11-15 0:18 ` Andrew Ardill 2012-11-15 0:40 ` Javier Domingo [not found] ` <CAKs0BQ7RyLZr+ZU=hAC4U7xXpE0+xvORTrvfzFYb6muA2TgM4Q@mail.gmail.com> 1 sibling, 1 reply; 13+ messages in thread From: Andrew Ardill @ 2012-11-15 0:18 UTC (permalink / raw) To: Javier Domingo; +Cc: git@vger.kernel.org On 15 November 2012 10:42, Javier Domingo <javierdo1@gmail.com> wrote: > Hi, > > I have come up with this while doing some local forks for work. > Currently, when you clone a repo using a path (not file:/// protocol) > you get all the common objects linked. > > But as you work, each one will continue growing on its way, although > they may have common objects. > > Is there any way to avoid this? I mean, can something be done in git, > that it checks for (when pulling) the same objects in the other forks? Have you seen alternates? From [1]: > How to share objects between existing repositories? > --------------------------------------------------------------------------- > > Do > > echo "/source/git/project/.git/objects/" > .git/objects/info/alternates > > and then follow it up with > > git repack -a -d -l > > where the '-l' means that it will only put local objects in the pack-file > (strictly speaking, it will put any loose objects from the alternate tree > too, so you'll have a fully packed archive, but it won't duplicate objects > that are already packed in the alternate tree). [1] https://git.wiki.kernel.org/index.php/GitFaq#How_to_share_objects_between_existing_repositories.3F Regards, Andrew Ardill ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-15 0:18 ` Andrew Ardill @ 2012-11-15 0:40 ` Javier Domingo 2012-11-15 0:53 ` Andrew Ardill 0 siblings, 1 reply; 13+ messages in thread From: Javier Domingo @ 2012-11-15 0:40 UTC (permalink / raw) To: Andrew Ardill; +Cc: git@vger.kernel.org Hi Andrew, The problem about that, is that if I want to delete the first repo, I will loose objects... Or does that repack also hard-link the objects in other repos? I don't want to accidentally loose data, so it would be nice that althought avoided to repack things, it would also hardlink them. Javier Domingo 2012/11/15 Andrew Ardill <andrew.ardill@gmail.com>: > On 15 November 2012 10:42, Javier Domingo <javierdo1@gmail.com> wrote: >> Hi, >> >> I have come up with this while doing some local forks for work. >> Currently, when you clone a repo using a path (not file:/// protocol) >> you get all the common objects linked. >> >> But as you work, each one will continue growing on its way, although >> they may have common objects. >> >> Is there any way to avoid this? I mean, can something be done in git, >> that it checks for (when pulling) the same objects in the other forks? > > Have you seen alternates? From [1]: > >> How to share objects between existing repositories? >> --------------------------------------------------------------------------- >> >> Do >> >> echo "/source/git/project/.git/objects/" > .git/objects/info/alternates >> >> and then follow it up with >> >> git repack -a -d -l >> >> where the '-l' means that it will only put local objects in the pack-file >> (strictly speaking, it will put any loose objects from the alternate tree >> too, so you'll have a fully packed archive, but it won't duplicate objects >> that are already packed in the alternate tree). > > [1] https://git.wiki.kernel.org/index.php/GitFaq#How_to_share_objects_between_existing_repositories.3F > > > Regards, > > Andrew Ardill ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-15 0:40 ` Javier Domingo @ 2012-11-15 0:53 ` Andrew Ardill 2012-11-15 1:15 ` Javier Domingo 0 siblings, 1 reply; 13+ messages in thread From: Andrew Ardill @ 2012-11-15 0:53 UTC (permalink / raw) To: Javier Domingo; +Cc: git@vger.kernel.org On 15 November 2012 11:40, Javier Domingo <javierdo1@gmail.com> wrote: > Hi Andrew, > > The problem about that, is that if I want to delete the first repo, I > will loose objects... Or does that repack also hard-link the objects > in other repos? I don't want to accidentally loose data, so it would > be nice that althought avoided to repack things, it would also > hardlink them. Hi Javier, check out the section below the one I linked earlier: > How to stop sharing objects between repositories? > > To copy the shared objects into the local repository, repack without the -l flag > > git repack -a > > Then remove the pointer to the alternate object store > > rm .git/objects/info/alternates > > (If the repository is edited between the two steps, it could become corrupted > when the alternates file is removed. If you're unsure, you can use git fsck to > check for corruption. If things go wrong, you can always recover by replacing > the alternates file and starting over). Regards, Andrew Ardill ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-15 0:53 ` Andrew Ardill @ 2012-11-15 1:15 ` Javier Domingo 2012-11-15 1:34 ` Andrew Ardill 2012-11-16 14:55 ` Pyeron, Jason J CTR (US) 0 siblings, 2 replies; 13+ messages in thread From: Javier Domingo @ 2012-11-15 1:15 UTC (permalink / raw) To: Andrew Ardill; +Cc: git@vger.kernel.org Hi Andrew, Doing this would require I got tracked which one comes from which. So it would imply some logic (and db) over it. With the hardlinking way, it wouldn't require anything. The idea is that you don't have to do anything else in the server. I understand that it would be imposible to do it for windows users (but using cygwin), but for *nix ones yes... Javier Domingo 2012/11/15 Andrew Ardill <andrew.ardill@gmail.com>: > On 15 November 2012 11:40, Javier Domingo <javierdo1@gmail.com> wrote: >> Hi Andrew, >> >> The problem about that, is that if I want to delete the first repo, I >> will loose objects... Or does that repack also hard-link the objects >> in other repos? I don't want to accidentally loose data, so it would >> be nice that althought avoided to repack things, it would also >> hardlink them. > > Hi Javier, check out the section below the one I linked earlier: > >> How to stop sharing objects between repositories? >> >> To copy the shared objects into the local repository, repack without the -l flag >> >> git repack -a >> >> Then remove the pointer to the alternate object store >> >> rm .git/objects/info/alternates >> >> (If the repository is edited between the two steps, it could become corrupted >> when the alternates file is removed. If you're unsure, you can use git fsck to >> check for corruption. If things go wrong, you can always recover by replacing >> the alternates file and starting over). > > Regards, > > Andrew Ardill ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-15 1:15 ` Javier Domingo @ 2012-11-15 1:34 ` Andrew Ardill 2012-11-15 3:44 ` Sitaram Chamarty 2012-11-16 14:55 ` Pyeron, Jason J CTR (US) 1 sibling, 1 reply; 13+ messages in thread From: Andrew Ardill @ 2012-11-15 1:34 UTC (permalink / raw) To: Javier Domingo; +Cc: git@vger.kernel.org On 15 November 2012 12:15, Javier Domingo <javierdo1@gmail.com> wrote: > Hi Andrew, > > Doing this would require I got tracked which one comes from which. So > it would imply some logic (and db) over it. With the hardlinking way, > it wouldn't require anything. The idea is that you don't have to do > anything else in the server. > > I understand that it would be imposible to do it for windows users > (but using cygwin), but for *nix ones yes... > Javier Domingo Paraphrasing from git-clone(1): When cloning a repository, if the source repository is specified with /path/to/repo syntax, the default is to clone the repository by making a copy of HEAD and everything under objects and refs directories. The files under .git/objects/ directory are hardlinked to save space when possible. To force copying instead of hardlinking (which may be desirable if you are trying to make a back-up of your repository) --no-hardlinks can be used. So hardlinks should be used where possible, and if they are not try upgrading Git. I think that covers all the use cases you have? Regards, Andrew Ardill ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-15 1:34 ` Andrew Ardill @ 2012-11-15 3:44 ` Sitaram Chamarty 2012-11-16 11:25 ` Michael J Gruber 0 siblings, 1 reply; 13+ messages in thread From: Sitaram Chamarty @ 2012-11-15 3:44 UTC (permalink / raw) To: Andrew Ardill; +Cc: Javier Domingo, git@vger.kernel.org On Thu, Nov 15, 2012 at 7:04 AM, Andrew Ardill <andrew.ardill@gmail.com> wrote: > On 15 November 2012 12:15, Javier Domingo <javierdo1@gmail.com> wrote: >> Hi Andrew, >> >> Doing this would require I got tracked which one comes from which. So >> it would imply some logic (and db) over it. With the hardlinking way, >> it wouldn't require anything. The idea is that you don't have to do >> anything else in the server. >> >> I understand that it would be imposible to do it for windows users >> (but using cygwin), but for *nix ones yes... >> Javier Domingo > > Paraphrasing from git-clone(1): > > When cloning a repository, if the source repository is specified with > /path/to/repo syntax, the default is to clone the repository by making > a copy of HEAD and everything under objects and refs directories. The > files under .git/objects/ directory are hardlinked to save space when > possible. To force copying instead of hardlinking (which may be > desirable if you are trying to make a back-up of your repository) > --no-hardlinks can be used. > > So hardlinks should be used where possible, and if they are not try > upgrading Git. > > I think that covers all the use cases you have? I am not sure it does. My understanding is this: 'git clone -l' saves space on the initial clone, but subsequent pushes end up with the same objects duplicated across all the "forks" (assuming most of the forks keep up with some canonical repo). The alternates mechanism can give you ongoing savings (as long as you push to the "main" repo first), but it is dangerous, in the words of the git-clone manpage. You have to be confident no one will delete a ref from the "main" repo and then do a gc or let it auto-gc. He's looking for something that addresses both these issues. As an additional idea, I suspect this is what the namespaces feature was created for, but I am not sure, and have never played with it till now. Maybe someone who knows namespaces very well will chip in... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-15 3:44 ` Sitaram Chamarty @ 2012-11-16 11:25 ` Michael J Gruber 2012-11-16 18:04 ` Enrico Weigelt 0 siblings, 1 reply; 13+ messages in thread From: Michael J Gruber @ 2012-11-16 11:25 UTC (permalink / raw) To: Sitaram Chamarty; +Cc: Andrew Ardill, Javier Domingo, git@vger.kernel.org Sitaram Chamarty venit, vidit, dixit 15.11.2012 04:44: > On Thu, Nov 15, 2012 at 7:04 AM, Andrew Ardill <andrew.ardill@gmail.com> wrote: >> On 15 November 2012 12:15, Javier Domingo <javierdo1@gmail.com> wrote: >>> Hi Andrew, >>> >>> Doing this would require I got tracked which one comes from which. So >>> it would imply some logic (and db) over it. With the hardlinking way, >>> it wouldn't require anything. The idea is that you don't have to do >>> anything else in the server. >>> >>> I understand that it would be imposible to do it for windows users >>> (but using cygwin), but for *nix ones yes... >>> Javier Domingo >> >> Paraphrasing from git-clone(1): >> >> When cloning a repository, if the source repository is specified with >> /path/to/repo syntax, the default is to clone the repository by making >> a copy of HEAD and everything under objects and refs directories. The >> files under .git/objects/ directory are hardlinked to save space when >> possible. To force copying instead of hardlinking (which may be >> desirable if you are trying to make a back-up of your repository) >> --no-hardlinks can be used. >> >> So hardlinks should be used where possible, and if they are not try >> upgrading Git. >> >> I think that covers all the use cases you have? > > I am not sure it does. My understanding is this: > > 'git clone -l' saves space on the initial clone, but subsequent pushes > end up with the same objects duplicated across all the "forks" > (assuming most of the forks keep up with some canonical repo). > > The alternates mechanism can give you ongoing savings (as long as you > push to the "main" repo first), but it is dangerous, in the words of > the git-clone manpage. You have to be confident no one will delete a > ref from the "main" repo and then do a gc or let it auto-gc. > > He's looking for something that addresses both these issues. > > As an additional idea, I suspect this is what the namespaces feature > was created for, but I am not sure, and have never played with it till > now. > > Maybe someone who knows namespaces very well will chip in... > I dunno about namespaces, but a safe route with alternates seems to be: Provide one "main" clone which is bare, pulls automatically, and is there to stay (no pruning), so that all others can use that as a reliable alternates source. Michael ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-16 11:25 ` Michael J Gruber @ 2012-11-16 18:04 ` Enrico Weigelt 2012-11-18 10:42 ` Sitaram Chamarty 0 siblings, 1 reply; 13+ messages in thread From: Enrico Weigelt @ 2012-11-16 18:04 UTC (permalink / raw) To: Michael J Gruber; +Cc: Andrew Ardill, Javier Domingo, git, Sitaram Chamarty > Provide one "main" clone which is bare, pulls automatically, and is > there to stay (no pruning), so that all others can use that as a > reliable alternates source. The problem here, IMHO, is the assumption, that the main repo will never be cleaned up. But what to do if you dont wanna let it grow forever ? hmm, distributed GC is a tricky problem. maybe it could be easier having two kind of alternates: a) classical: gc+friends will drop local objects that are already there b) fallback: normal operations fetch objects if not accessible from anywhere else, but gc+friends do not skip objects from there. And extend prune machinery to put some backup of the dropped objects to some separate store. This way we could use some kind of rotating archive: * GC'ed objects will be stored in the backup repo for some while * there are multiple active (rotating) backups kept for some time, each cycle, only the oldest one is dropped (and maybe objects in a newer backup are removed from the older ones) * downstream repos must be synced often enough, so removed objects are fetched back from the backups early enough You could see this as some heap: * the currently active objects (directly referenced) are always on the top * once they're not referenced, they sink a lever deeper * when the're referenced again, they immediately jump up to the top * at some point in time unreferenced objects sink too deep that they're dropped completely cu -- Mit freundlichen Grüßen / Kind regards Enrico Weigelt VNC - Virtual Network Consult GmbH Head Of Development Pariser Platz 4a, D-10117 Berlin Tel.: +49 (30) 3464615-20 Fax: +49 (30) 3464615-59 enrico.weigelt@vnc.biz; www.vnc.de ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-16 18:04 ` Enrico Weigelt @ 2012-11-18 10:42 ` Sitaram Chamarty 2012-11-18 17:02 ` Enrico Weigelt 0 siblings, 1 reply; 13+ messages in thread From: Sitaram Chamarty @ 2012-11-18 10:42 UTC (permalink / raw) To: Enrico Weigelt; +Cc: Michael J Gruber, Andrew Ardill, Javier Domingo, git On Fri, Nov 16, 2012 at 11:34 PM, Enrico Weigelt <enrico.weigelt@vnc.biz> wrote: > >> Provide one "main" clone which is bare, pulls automatically, and is >> there to stay (no pruning), so that all others can use that as a >> reliable alternates source. > > The problem here, IMHO, is the assumption, that the main repo will > never be cleaned up. But what to do if you dont wanna let it grow > forever ? That's not the only problem. I believe you only get the savings when the main repo gets the commits first. Which is probably ok most of the time but it's worth mentioning. > > hmm, distributed GC is a tricky problem. Except for one little issue (see other thread, subject line "cloning a namespace downloads all the objects"), namespaces appear to do everything we want in terms of the typical use cases for alternates, and/or 'git clone -l', at least on the server side. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Local clones aka forks disk size optimization 2012-11-18 10:42 ` Sitaram Chamarty @ 2012-11-18 17:02 ` Enrico Weigelt 0 siblings, 0 replies; 13+ messages in thread From: Enrico Weigelt @ 2012-11-18 17:02 UTC (permalink / raw) To: Sitaram Chamarty; +Cc: Michael J Gruber, Andrew Ardill, Javier Domingo, git Hi, > That's not the only problem. I believe you only get the savings when > the main repo gets the commits first. Which is probably ok most of > the time but it's worth mentioning. Well, the saving will just be deferred to the point where the commit finally went to the main repo and downstreams are gc'ed. > > hmm, distributed GC is a tricky problem. > > Except for one little issue (see other thread, subject line "cloning > a > namespace downloads all the objects"), namespaces appear to do > everything we want in terms of the typical use cases for alternates, > and/or 'git clone -l', at least on the server side. hmm, not sure about the actual internals, but that namespace filtering should work in a way that local clone should never see (or consider) remote refs that are outside of the requested namespace. Perhaps that should be handled entirely on server side, so all called commands treat these refs as nonexisting. By the way: what happens if one tries to clone from an broken repo (which has several refs pointing to nonexisting objects ? cu -- Mit freundlichen Grüßen / Kind regards Enrico Weigelt VNC - Virtual Network Consult GmbH Head Of Development Pariser Platz 4a, D-10117 Berlin Tel.: +49 (30) 3464615-20 Fax: +49 (30) 3464615-59 enrico.weigelt@vnc.biz; www.vnc.de ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: Local clones aka forks disk size optimization 2012-11-15 1:15 ` Javier Domingo 2012-11-15 1:34 ` Andrew Ardill @ 2012-11-16 14:55 ` Pyeron, Jason J CTR (US) 1 sibling, 0 replies; 13+ messages in thread From: Pyeron, Jason J CTR (US) @ 2012-11-16 14:55 UTC (permalink / raw) To: git@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 612 bytes --] > -----Original Message----- > From: Javier Domingo > Sent: Wednesday, November 14, 2012 8:15 PM > > Hi Andrew, > > Doing this would require I got tracked which one comes from which. So > it would imply some logic (and db) over it. With the hardlinking way, > it wouldn't require anything. The idea is that you don't have to do > anything else in the server. > > I understand that it would be imposible to do it for windows users Not true, it is a file system issue not an os issue. FAT does not support hard links, but ext2,3,4 and NTFS do. > (but using cygwin), but for *nix ones yes... > Javier Domingo [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5615 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <CAKs0BQ7RyLZr+ZU=hAC4U7xXpE0+xvORTrvfzFYb6muA2TgM4Q@mail.gmail.com>]
* Re: Local clones aka forks disk size optimization [not found] ` <CAKs0BQ7RyLZr+ZU=hAC4U7xXpE0+xvORTrvfzFYb6muA2TgM4Q@mail.gmail.com> @ 2012-11-18 17:18 ` Jörg Rosenkranz 0 siblings, 0 replies; 13+ messages in thread From: Jörg Rosenkranz @ 2012-11-18 17:18 UTC (permalink / raw) To: Javier Domingo; +Cc: git 2012/11/15 Javier Domingo <javierdo1@gmail.com> > > Is there any way to avoid this? I mean, can something be done in git, > that it checks for (when pulling) the same objects in the other forks? I've been using git-new-workdir (https://github.com/git/git/blob/master/contrib/workdir/git-new-workdir) for a similar problem. Maybe that's what you're searching? Joerg. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-11-18 17:19 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CALZVapmG+HL0SQx8zx=Cfz5pWv84hJq90x-7VdjA0m2Z4dC34A@mail.gmail.com> 2012-11-14 23:42 ` Fwd: Local clones aka forks disk size optimization Javier Domingo 2012-11-15 0:18 ` Andrew Ardill 2012-11-15 0:40 ` Javier Domingo 2012-11-15 0:53 ` Andrew Ardill 2012-11-15 1:15 ` Javier Domingo 2012-11-15 1:34 ` Andrew Ardill 2012-11-15 3:44 ` Sitaram Chamarty 2012-11-16 11:25 ` Michael J Gruber 2012-11-16 18:04 ` Enrico Weigelt 2012-11-18 10:42 ` Sitaram Chamarty 2012-11-18 17:02 ` Enrico Weigelt 2012-11-16 14:55 ` Pyeron, Jason J CTR (US) [not found] ` <CAKs0BQ7RyLZr+ZU=hAC4U7xXpE0+xvORTrvfzFYb6muA2TgM4Q@mail.gmail.com> 2012-11-18 17:18 ` Jörg Rosenkranz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).