* Dealing with many many git repos in a /home directory
@ 2010-02-04 8:29 demerphq
2010-02-04 9:57 ` Alex Riesen
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: demerphq @ 2010-02-04 8:29 UTC (permalink / raw)
To: Git
At $work we have a host where we have about 50-100 users each with
their own private copies of the same repos. These are cloned froma
remote via git/ssh and are not thus automatically hardlinking their
object stores.
This is starting to take a lot of space.
I was thinking it should be possible to hardlink all of the objects in
the different repos to a canonical single copy.
Would i be correct in thinking that if i have to repos with an
equivalent .git/objects/../..... file in them that the files are
necessarily identical and one can be replaced by a hardlink to the
other?
If this is correct then is there some tool known to the list that
already does this? I whipped this together:
find /home -regex .\*/.git/objects/.\* | perl -lne'if
(m!(\.git/objects/../.+)!) { if (my $t= $seen{$1}) { link $t,$_ } else
{ $seen{$1}=$_ } }'
But a proper script with a sign off of some git dev would make me feel
a lot safer :-)
cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Dealing with many many git repos in a /home directory
2010-02-04 8:29 Dealing with many many git repos in a /home directory demerphq
@ 2010-02-04 9:57 ` Alex Riesen
2010-02-04 15:20 ` Sergio
2010-02-04 15:00 ` Martin Langhoff
` (2 subsequent siblings)
3 siblings, 1 reply; 6+ messages in thread
From: Alex Riesen @ 2010-02-04 9:57 UTC (permalink / raw)
To: demerphq; +Cc: Git
On Thu, Feb 4, 2010 at 09:29, demerphq <demerphq@gmail.com> wrote:
> Would i be correct in thinking that if i have to repos with an
> equivalent .git/objects/../..... file in them that the files are
> necessarily identical and one can be replaced by a hardlink to the
> other?
Yes, but you probably wont save as much as you'd like: think about the users
who *do* repack their repositories. The .pack files will be all different.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Dealing with many many git repos in a /home directory
2010-02-04 9:57 ` Alex Riesen
@ 2010-02-04 15:20 ` Sergio
0 siblings, 0 replies; 6+ messages in thread
From: Sergio @ 2010-02-04 15:20 UTC (permalink / raw)
To: git
Alex Riesen <raa.lkml <at> gmail.com> writes:
>
> On Thu, Feb 4, 2010 at 09:29, demerphq <demerphq <at> gmail.com> wrote:
> > Would i be correct in thinking that if i have to repos with an
> > equivalent .git/objects/../..... file in them that the files are
> > necessarily identical and one can be replaced by a hardlink to the
> > other?
>
> Yes, but you probably wont save as much as you'd like: think about the
> users
> who *do* repack their repositories. The .pack files will be all
> different.
>
Maybe you can:
for each repo
clone it to some place
pack it with gc --aggressive
take the resulting pack and move it (and the associated index) somewhere
make in the same place a file with the same hash as the pack and extension
keep and possibly, inside, some note about its content (e.q. what repo
was cloned and at what state/time it was so frozen).
ask the users to go in the .git/objects/packs dir of their private copy
of the corresponding repo and hardlink there the .pack, .idx, .keep
file that you have prepared
ask the users to invoke git gc
Before actually doing that on something important, maybe wait have the
confirmation from some developer that there is not something flawed in the
approach.
Personally, I tend to use keep files a lot because I need to keep two
machines synchronized using "unison". Without keep files, large packs are
changed at every gc and the synchronization takes ages. By "freezing" a
stable subset of my objects I maintain the changing packs much smaller and
reduce the amount of data that needs to be carried over by unison to keep
the two machines in sync.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Dealing with many many git repos in a /home directory
2010-02-04 8:29 Dealing with many many git repos in a /home directory demerphq
2010-02-04 9:57 ` Alex Riesen
@ 2010-02-04 15:00 ` Martin Langhoff
2010-02-04 15:32 ` Andreas Schwab
2010-02-04 17:35 ` Nicolas Pitre
3 siblings, 0 replies; 6+ messages in thread
From: Martin Langhoff @ 2010-02-04 15:00 UTC (permalink / raw)
To: demerphq; +Cc: Git
On Thu, Feb 4, 2010 at 3:29 AM, demerphq <demerphq@gmail.com> wrote:
> This is starting to take a lot of space.
What I used to do was to
- have a "canonical" local bare repo for each major project, fetching
and repacking nightly
- a script that "injects" an "alternates" entry to matching user
repos -- logic to look at a repo and decide which alternate to hook it
to is left to the reader.
- optional: automating repacks on users repos
As users repack, their "local" packs will only have the objects that
are not shared with the canonical repos. With Moodle repos, this was a
200MB savings per repo.
And the kernel keeps one set of packfiles in buffers, so everyone gets
much faster gitk / gitlog / blame...
m
--
martin.langhoff@gmail.com
martin@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Dealing with many many git repos in a /home directory
2010-02-04 8:29 Dealing with many many git repos in a /home directory demerphq
2010-02-04 9:57 ` Alex Riesen
2010-02-04 15:00 ` Martin Langhoff
@ 2010-02-04 15:32 ` Andreas Schwab
2010-02-04 17:35 ` Nicolas Pitre
3 siblings, 0 replies; 6+ messages in thread
From: Andreas Schwab @ 2010-02-04 15:32 UTC (permalink / raw)
To: demerphq; +Cc: Git
demerphq <demerphq@gmail.com> writes:
> At $work we have a host where we have about 50-100 users each with
> their own private copies of the same repos. These are cloned froma
> remote via git/ssh and are not thus automatically hardlinking their
> object stores.
>
> This is starting to take a lot of space.
Create local mirrors of the remote repos (and update them regularily)
and ask the users to borrow from them.
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Dealing with many many git repos in a /home directory
2010-02-04 8:29 Dealing with many many git repos in a /home directory demerphq
` (2 preceding siblings ...)
2010-02-04 15:32 ` Andreas Schwab
@ 2010-02-04 17:35 ` Nicolas Pitre
3 siblings, 0 replies; 6+ messages in thread
From: Nicolas Pitre @ 2010-02-04 17:35 UTC (permalink / raw)
To: demerphq; +Cc: Git
On Thu, 4 Feb 2010, demerphq wrote:
> At $work we have a host where we have about 50-100 users each with
> their own private copies of the same repos. These are cloned froma
> remote via git/ssh and are not thus automatically hardlinking their
> object stores.
>
> This is starting to take a lot of space.
You should keep a pristine copy of that common repository on that host
and make it readable to everyone, and then ask your users to use the
--reference argument with 'git clone' to borrow as much as possible from
that common repository.
For those who already cloned the repository in full i.e. without the
--reference switch, then it is possible to fix the situation simply by
adding the full path to the common repository's .git/objects directory
in their own .git/objects/info/alternates (create it if it doesn't
exist) and then run 'git gc'. That's what the --reference argument to
the clone command does: setting up that .git/objects/info/alternates
file.
> I was thinking it should be possible to hardlink all of the objects in
> the different repos to a canonical single copy.
>
> Would i be correct in thinking that if i have to repos with an
> equivalent .git/objects/../..... file in them that the files are
> necessarily identical and one can be replaced by a hardlink to the
> other?
Yes, you could do that. However you'll save very little by doing that
as the bulk of a repository content is normally stored into pack files,
and those may differ from one repository to another depending on what
exactly the pack contains. The alternates mechanism is more powerful as
it lets Git fetch objects from the canonical repository packed or not,
and more importantly it avoids creating local copy of new objects if
they already exists in that canonical copy meaning that you don't have
to constantly search in every user's repository for potential new
objects to hardlink.
> If this is correct then is there some tool known to the list that
> already does this? I whipped this together:
The "tool" exists in Git already and is what I describe above. The
actual tool you might need is probably a script to populate that
.git/objects/info/alternates file in all your users' repositoryes and
maybe run ,git gc' on their behalf.
Nicolas
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-02-04 17:35 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-04 8:29 Dealing with many many git repos in a /home directory demerphq
2010-02-04 9:57 ` Alex Riesen
2010-02-04 15:20 ` Sergio
2010-02-04 15:00 ` Martin Langhoff
2010-02-04 15:32 ` Andreas Schwab
2010-02-04 17:35 ` Nicolas Pitre
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).