git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Dealing with many many git repos in a /home directory
@ 2010-02-04  8:29 demerphq
  2010-02-04  9:57 ` Alex Riesen
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: demerphq @ 2010-02-04  8:29 UTC (permalink / raw)
  To: Git

At $work we have a host where we have about 50-100 users each with
their own private copies of the same repos. These are cloned froma
remote via git/ssh and are not thus automatically hardlinking their
object stores.

This is starting to take a lot of space.

I was thinking it should be possible to hardlink all of the objects in
the different repos to a canonical single copy.

Would i be correct in thinking that if i have to repos with an
equivalent  .git/objects/../..... file in them that the files are
necessarily identical and one can be replaced by a hardlink to the
other?

If this is correct then is there some tool known to the list that
already does this?  I whipped this together:

find /home -regex .\*/.git/objects/.\* | perl -lne'if
(m!(\.git/objects/../.+)!) { if (my $t= $seen{$1}) { link $t,$_ } else
{ $seen{$1}=$_ } }'

But a proper script with a sign off of some git dev would make me feel
a lot safer :-)

cheers,
Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dealing with many many git repos in a /home directory
  2010-02-04  8:29 Dealing with many many git repos in a /home directory demerphq
@ 2010-02-04  9:57 ` Alex Riesen
  2010-02-04 15:20   ` Sergio
  2010-02-04 15:00 ` Martin Langhoff
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Alex Riesen @ 2010-02-04  9:57 UTC (permalink / raw)
  To: demerphq; +Cc: Git

On Thu, Feb 4, 2010 at 09:29, demerphq <demerphq@gmail.com> wrote:
> Would i be correct in thinking that if i have to repos with an
> equivalent  .git/objects/../..... file in them that the files are
> necessarily identical and one can be replaced by a hardlink to the
> other?

Yes, but you probably wont save as much as you'd like: think about the users
who *do* repack their repositories. The .pack files will be all different.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dealing with many many git repos in a /home directory
  2010-02-04  8:29 Dealing with many many git repos in a /home directory demerphq
  2010-02-04  9:57 ` Alex Riesen
@ 2010-02-04 15:00 ` Martin Langhoff
  2010-02-04 15:32 ` Andreas Schwab
  2010-02-04 17:35 ` Nicolas Pitre
  3 siblings, 0 replies; 6+ messages in thread
From: Martin Langhoff @ 2010-02-04 15:00 UTC (permalink / raw)
  To: demerphq; +Cc: Git

On Thu, Feb 4, 2010 at 3:29 AM, demerphq <demerphq@gmail.com> wrote:
> This is starting to take a lot of space.

What I used to do was to

 - have a "canonical" local bare repo for each major project, fetching
and repacking nightly

 - a script that "injects" an "alternates" entry to matching user
repos -- logic to look at a repo and decide which alternate to hook it
to is left to the reader.

 - optional: automating repacks on users repos

As users repack, their "local" packs will only have the objects that
are not shared with the canonical repos. With Moodle repos, this was a
200MB savings per repo.

And the kernel keeps one set of packfiles in buffers, so everyone gets
much faster gitk / gitlog / blame...



m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dealing with many many git repos in a /home directory
  2010-02-04  9:57 ` Alex Riesen
@ 2010-02-04 15:20   ` Sergio
  0 siblings, 0 replies; 6+ messages in thread
From: Sergio @ 2010-02-04 15:20 UTC (permalink / raw)
  To: git

Alex Riesen <raa.lkml <at> gmail.com> writes:

> 
> On Thu, Feb 4, 2010 at 09:29, demerphq <demerphq <at> gmail.com> wrote:
> > Would i be correct in thinking that if i have to repos with an
> > equivalent  .git/objects/../..... file in them that the files are
> > necessarily identical and one can be replaced by a hardlink to the
> > other?
> 
> Yes, but you probably wont save as much as you'd like: think about the
> users
> who *do* repack their repositories. The .pack files will be all
> different.
> 


Maybe you can:

for each repo
  clone it to some place
  pack it with gc --aggressive
  take the resulting pack and move it (and the associated index) somewhere
  make in the same place a file with the same hash as the pack and extension
    keep and possibly, inside, some note about its content (e.q. what repo
    was cloned and at what state/time it was so frozen).
  ask the users to go in the .git/objects/packs dir of their private copy
    of the corresponding repo and hardlink there the .pack, .idx, .keep
    file that you have prepared
  ask the users to invoke git gc

Before actually doing that on something important, maybe wait have the
confirmation from some developer that there is not something flawed in the
approach.

Personally, I tend to use keep files a lot because I need to keep two
machines synchronized using "unison". Without keep files, large packs are
changed at every gc and the synchronization takes ages. By "freezing" a
stable subset of my objects I maintain the changing packs much smaller and
reduce the amount of data that needs to be carried over by unison to keep
the two machines in sync.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dealing with many many git repos in a /home directory
  2010-02-04  8:29 Dealing with many many git repos in a /home directory demerphq
  2010-02-04  9:57 ` Alex Riesen
  2010-02-04 15:00 ` Martin Langhoff
@ 2010-02-04 15:32 ` Andreas Schwab
  2010-02-04 17:35 ` Nicolas Pitre
  3 siblings, 0 replies; 6+ messages in thread
From: Andreas Schwab @ 2010-02-04 15:32 UTC (permalink / raw)
  To: demerphq; +Cc: Git

demerphq <demerphq@gmail.com> writes:

> At $work we have a host where we have about 50-100 users each with
> their own private copies of the same repos. These are cloned froma
> remote via git/ssh and are not thus automatically hardlinking their
> object stores.
>
> This is starting to take a lot of space.

Create local mirrors of the remote repos (and update them regularily)
and ask the users to borrow from them.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dealing with many many git repos in a /home directory
  2010-02-04  8:29 Dealing with many many git repos in a /home directory demerphq
                   ` (2 preceding siblings ...)
  2010-02-04 15:32 ` Andreas Schwab
@ 2010-02-04 17:35 ` Nicolas Pitre
  3 siblings, 0 replies; 6+ messages in thread
From: Nicolas Pitre @ 2010-02-04 17:35 UTC (permalink / raw)
  To: demerphq; +Cc: Git

On Thu, 4 Feb 2010, demerphq wrote:

> At $work we have a host where we have about 50-100 users each with
> their own private copies of the same repos. These are cloned froma
> remote via git/ssh and are not thus automatically hardlinking their
> object stores.
> 
> This is starting to take a lot of space.

You should keep a pristine copy of that common repository on that host 
and make it readable to everyone, and then ask your users to use the 
--reference argument with 'git clone' to borrow as much as possible from 
that common repository.

For those who already cloned the repository in full i.e. without the 
--reference switch, then it is possible to fix the situation simply by 
adding the full path to the common repository's .git/objects directory 
in their own .git/objects/info/alternates (create it if it doesn't 
exist) and then run 'git gc'.  That's what the --reference argument to 
the clone command does: setting up that .git/objects/info/alternates 
file.

> I was thinking it should be possible to hardlink all of the objects in
> the different repos to a canonical single copy.
> 
> Would i be correct in thinking that if i have to repos with an
> equivalent  .git/objects/../..... file in them that the files are
> necessarily identical and one can be replaced by a hardlink to the
> other?

Yes, you could do that.  However you'll save very little by doing that 
as the bulk of a repository content is normally stored into pack files, 
and those may differ from one repository to another depending on what 
exactly the pack contains.  The alternates mechanism is more powerful as 
it lets Git fetch objects from the canonical repository packed or not, 
and more importantly it avoids creating local copy of new objects if 
they already exists in that canonical copy meaning that you don't have 
to constantly search in every user's repository for potential new 
objects to hardlink.

> If this is correct then is there some tool known to the list that
> already does this?  I whipped this together:

The "tool" exists in Git already and is what I describe above.  The 
actual tool you might need is probably a script to populate that 
.git/objects/info/alternates file in all your users' repositoryes and 
maybe run ,git gc' on their behalf.


Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-02-04 17:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-04  8:29 Dealing with many many git repos in a /home directory demerphq
2010-02-04  9:57 ` Alex Riesen
2010-02-04 15:20   ` Sergio
2010-02-04 15:00 ` Martin Langhoff
2010-02-04 15:32 ` Andreas Schwab
2010-02-04 17:35 ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).