git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Saving space/network on common repos
@ 2014-12-17  6:58 Craig Silverstein
  2014-12-17 22:01 ` Stefan Beller
  2014-12-17 22:32 ` Jonathan Nieder
  0 siblings, 2 replies; 9+ messages in thread
From: Craig Silverstein @ 2014-12-17  6:58 UTC (permalink / raw)
  To: git

At Khan Academy, we are running a Jenkins installation as our build
server.  By design, our Jenkins machine has several different
directories that each hold a copy of the same git repository.  (For
instance, Jenkins may be running tests on our repo at several
different commits at the same time.)  When Jenkins decides to run a
test -- I'm simplifying a bit -- it will pick one of the copies of the
repo, do a 'git fetch origin && git checkout <some commit>' and the
run the tests.

Our repo has a lot of churn and some big files, and this git fetch can
take a long time. I'd like to reduce both the time to fetch and the
disk space used by sharing objects between the repo copies.

My research has turned up three techniques that try to address this use case:
* git clone --reference
* git clone --shared
* git clone <local repo>, which creates hard links

I can probably use any of these approaches, but git clone --reference
would be the easiest to set up.  I would do so by creating a 'cache'
repo that is just created to serve as a reference and not used in any
other way, so I wouldn't have to worry about the dangers with pruning,
accidentally deleting the repo, etc.

My big concern is that all these methods seem to just affect clone.  So:

Question 1) If I do 'git clone --reference, will the reference repo be
used for subsequent fetches as well?  What about 'git clone --shared'?

Question 2) If I git clone a local repo, will subsequent fetches also
create hard links?

Question 3) If the answer to any of the above is yes, how does this
work with packing?  Say I pack the reference repo (being careful not
to prune anything).  Will subsequent fetches still be able to get the
objects they need from the reference repo?

An added complication is submodules.  We have a submodule that is as
big and slow to fetch as our main repository.

Question 4) Is there a practical way to set up submodules so they can
use the same object-sharing framework that the main repo does?

I'm not keen on rewriting .gitmodules in each of my repos, so probably
something that uses info/alternates is the most workable.  I have a
scheme for setting that up that maybe will work, but it's a moot point
if info/alternates doesn't work for fetching.

I'm wondering if the best approach for us might be to use
GIT_OBJECT_DIRECTORY: set GIT_OBJECT_DIRECTORY to the shared cached
directory for each of our repos, so they all fetch to the same place.

Question 5) I haven't seen this mentioned anywhere else, so I'm
guessing it won't work.  Am I missing a big problem?

Question 6) Will git be sad if two different repos that share an
object directory, both do 'git fetch' at the same time?  I could maybe
protect fetches with an flock, but jenkins can do git operations
behind my back so it would be easier if I didn't have to worry about
locking.

Question 7) Is GIT_OBJECT_DIRECTORY supposed to work with subrepos?
In my experimentation, it looks like it doesn't: when I run
'GIT_OBJECT_DIRECTORY=../obj git submodule update --init' it still
puts the objects in .git/modules/<submodule>/objects/.  Is this a bug?
 Is there any way to work around it?

Any suggestions would be appreciated!  It feels to me like this is
something that git should support pretty easily given its
architecture, but I just don't see a way to do it.

Thanks,
craig

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-12-23  5:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-17  6:58 Saving space/network on common repos Craig Silverstein
2014-12-17 22:01 ` Stefan Beller
2014-12-17 22:32 ` Jonathan Nieder
2014-12-17 23:57   ` Craig Silverstein
2014-12-18  0:07     ` Jonathan Nieder
2014-12-23  1:00       ` Craig Silverstein
2014-12-23  1:33         ` Jonathan Nieder
2014-12-23  3:12         ` Jonathan Nieder
2014-12-23  5:36           ` Craig Silverstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).