* GIT as binary repository
@ 2010-10-21 12:52 Wilson, Kevin Lee (OpenView Engineer)
2010-10-21 14:19 ` Tay Ray Chuan
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Wilson, Kevin Lee (OpenView Engineer) @ 2010-10-21 12:52 UTC (permalink / raw)
To: git@vger.kernel.org
Hello,
We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old?
I also have some questions, about how the workflow would be for getting all of the changes merged from several different teams into the one repository would operate. Do we setup a shared system for engineers to perform the merges onto? Our teams are geographically disbursed.
Thanks for any light you can shed on this. We are trying to digest a lot of information quickly, so sorry if there are things covered here that are elsewhere.
Thanks,
Kevin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository
2010-10-21 12:52 GIT as binary repository Wilson, Kevin Lee (OpenView Engineer)
@ 2010-10-21 14:19 ` Tay Ray Chuan
2010-10-21 19:54 ` Enrico Weigelt
2010-10-21 17:38 ` Shawn Pearce
2010-10-21 19:19 ` Enrico Weigelt
2 siblings, 1 reply; 9+ messages in thread
From: Tay Ray Chuan @ 2010-10-21 14:19 UTC (permalink / raw)
To: Wilson, Kevin Lee (OpenView Engineer); +Cc: git@vger.kernel.org
Hi,
On Thu, Oct 21, 2010 at 8:52 PM, Wilson, Kevin Lee (OpenView Engineer)
<kevin.l.wilson@hp.com> wrote:
> We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old?
check this out:
http://github.com/apenwarr/bup
It's a modified git system that's purpose-built for large files.
That's just about all the sensible information I can share you with
you.
--
Cheers,
Ray Chuan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository
2010-10-21 12:52 GIT as binary repository Wilson, Kevin Lee (OpenView Engineer)
2010-10-21 14:19 ` Tay Ray Chuan
@ 2010-10-21 17:38 ` Shawn Pearce
2010-10-21 18:53 ` Wilson, Kevin Lee (OpenView Engineer)
2010-10-21 19:19 ` Enrico Weigelt
2 siblings, 1 reply; 9+ messages in thread
From: Shawn Pearce @ 2010-10-21 17:38 UTC (permalink / raw)
To: Wilson, Kevin Lee (OpenView Engineer); +Cc: git@vger.kernel.org
On Thu, Oct 21, 2010 at 5:52 AM, Wilson, Kevin Lee (OpenView Engineer)
<kevin.l.wilson@hp.com> wrote:
> We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old?
Not really.
Teams who need to store content like this are taking two approaches:
Bite the bullet and use 64 bit systems with a lot of physical memory.
Git allocates/memmaps two memory blocks equal in size to the file you
are trying to work with. If you have an 800MB file, you need ~1.6G of
physical memory just for the Git executable to touch that file. For
most modern desktops and server systems, this is pretty easy to deal
with, 4G or 8G of physical memory in a developer workstation is pretty
inexpensive. If the files aren't delta compressible, you can speed up
delta compressing operations that occur during `git gc` by adding a
gitattribute with the "-delta" flag for the relevant path files to
your .git/info/attributes file. Unfortunately this may mean that your
Git repository is large (>20G?), and each developer needs to make a
full copy of it when they start to work on the project. That is a lot
of data to move around, or to store locally. But again when you look
at the cost of disk on a developer workstation, this may not be an
issue if your team can adopt a workflow where they don't clone the
repository often. (E.g. the Android repository is about 7G,
developers clone it once and then don't need to again... so it is
doable.)
The other option is to use a different repository for the binary
files. Some teams are using a REST enabled HTTP server like Amazon S3
(though you probably want something inside your corporate firewall) to
store the large binary files. Instead of putting the files into Git
the put a small shell script and a pointer to the file into Git. The
shell script downloads the large binary file when executed, and the
build process (or the developer "start-up" instructions) execute the
script to get the latest versions bootstrapped on the local
workstation.
> I also have some questions, about how the workflow would be for getting all of the changes merged from several different teams into the one repository would operate. Do we setup a shared system for engineers to perform the merges onto? Our teams are geographically disbursed.
Yes, this is the common approach. Actually what I have started to see
with Android is, each distributed office has a shared repository that
the engineers in that office interact with on a daily basis. And
someone in each office synchronizes that repository with a single
central repository that exists somewhere else. Because of the nature
of Git, the central repository can be continuously pulled into the
distributed office through a cron script. Engineers in the office can
therefore always have a "fairly latest" version available, but can
also fork off onto a side-branch and defer merging with the other
offices for a day or two.
Android teams are successfully using this approach by running Gerrit
Code Review[1] as their central server, and using Gerrit's built-in
replication feature to push updates to the distributed office servers.
In effect there is one central server for writes, but a lot of read
operations are offloaded into the distributed offices local copies.
[1] http://code.google.com/p/gerrit/
--
Shawn.
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: GIT as binary repository
2010-10-21 17:38 ` Shawn Pearce
@ 2010-10-21 18:53 ` Wilson, Kevin Lee (OpenView Engineer)
0 siblings, 0 replies; 9+ messages in thread
From: Wilson, Kevin Lee (OpenView Engineer) @ 2010-10-21 18:53 UTC (permalink / raw)
To: Shawn Pearce; +Cc: git@vger.kernel.org
Thanks for the detailed input.
Kevin
-----Original Message-----
From: Shawn Pearce [mailto:spearce@spearce.org]
Sent: Thursday, October 21, 2010 11:39 AM
To: Wilson, Kevin Lee (OpenView Engineer)
Cc: git@vger.kernel.org
Subject: Re: GIT as binary repository
On Thu, Oct 21, 2010 at 5:52 AM, Wilson, Kevin Lee (OpenView Engineer)
<kevin.l.wilson@hp.com> wrote:
> We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old?
Not really.
Teams who need to store content like this are taking two approaches:
Bite the bullet and use 64 bit systems with a lot of physical memory.
Git allocates/memmaps two memory blocks equal in size to the file you
are trying to work with. If you have an 800MB file, you need ~1.6G of
physical memory just for the Git executable to touch that file. For
most modern desktops and server systems, this is pretty easy to deal
with, 4G or 8G of physical memory in a developer workstation is pretty
inexpensive. If the files aren't delta compressible, you can speed up
delta compressing operations that occur during `git gc` by adding a
gitattribute with the "-delta" flag for the relevant path files to
your .git/info/attributes file. Unfortunately this may mean that your
Git repository is large (>20G?), and each developer needs to make a
full copy of it when they start to work on the project. That is a lot
of data to move around, or to store locally. But again when you look
at the cost of disk on a developer workstation, this may not be an
issue if your team can adopt a workflow where they don't clone the
repository often. (E.g. the Android repository is about 7G,
developers clone it once and then don't need to again... so it is
doable.)
The other option is to use a different repository for the binary
files. Some teams are using a REST enabled HTTP server like Amazon S3
(though you probably want something inside your corporate firewall) to
store the large binary files. Instead of putting the files into Git
the put a small shell script and a pointer to the file into Git. The
shell script downloads the large binary file when executed, and the
build process (or the developer "start-up" instructions) execute the
script to get the latest versions bootstrapped on the local
workstation.
> I also have some questions, about how the workflow would be for getting all of the changes merged from several different teams into the one repository would operate. Do we setup a shared system for engineers to perform the merges onto? Our teams are geographically disbursed.
Yes, this is the common approach. Actually what I have started to see
with Android is, each distributed office has a shared repository that
the engineers in that office interact with on a daily basis. And
someone in each office synchronizes that repository with a single
central repository that exists somewhere else. Because of the nature
of Git, the central repository can be continuously pulled into the
distributed office through a cron script. Engineers in the office can
therefore always have a "fairly latest" version available, but can
also fork off onto a side-branch and defer merging with the other
offices for a day or two.
Android teams are successfully using this approach by running Gerrit
Code Review[1] as their central server, and using Gerrit's built-in
replication feature to push updates to the distributed office servers.
In effect there is one central server for writes, but a lot of read
operations are offloaded into the distributed offices local copies.
[1] http://code.google.com/p/gerrit/
--
Shawn.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository
2010-10-21 12:52 GIT as binary repository Wilson, Kevin Lee (OpenView Engineer)
2010-10-21 14:19 ` Tay Ray Chuan
2010-10-21 17:38 ` Shawn Pearce
@ 2010-10-21 19:19 ` Enrico Weigelt
2 siblings, 0 replies; 9+ messages in thread
From: Enrico Weigelt @ 2010-10-21 19:19 UTC (permalink / raw)
To: git@vger.kernel.org
* Wilson, Kevin Lee (OpenView Engineer) <kevin.l.wilson@hp.com> wrote:
Hi,
> We are investigating the use of GIT as a binary repository solution.
> Our larger files are near 800MB and the total checked out repo size
> is about 3 GB the repo size in SVN is more like 20-30GB, if we could
> prune the history prior to MR, we could get these sizes down
> considerably. This binary repo is really for our super project build.
What exactly do you need such large binary objects in an git repo ?
IMHO, Git isn't made for large files. I've noticed this when doing
git-based mail archives on an old P3 box w/ 256MB physical memory.
I had to split mbox'es to maildirs.
Perhaps you would like to have a look at some pure object store like
venti ? (It's not distributed yet, but I'm currently working on an
distributed successor, called Nebulon, which will also support
strong encryption, on-demand replication, etc).
> I also have some questions, about how the workflow would be for
> getting all of the changes merged from several different teams
> into the one repository would operate.
IMHO, there should be some dedicated release manager role, which
is responsible for merging finished branches into the mainline
(eg. similar that Linus does for the official Linux tree).
BUT: you perhaps should think carefully, whether you need everything
in one big repo. Perhaps a bunch of smaller ones (eg. having separate
modules in the own repos/trees) would fit better.
cu
--
----------------------------------------------------------------------
Enrico Weigelt, metux IT service -- http://www.metux.de/
phone: +49 36207 519931 email: weigelt@metux.de
mobile: +49 151 27565287 icq: 210169427 skype: nekrad666
----------------------------------------------------------------------
Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository
2010-10-21 14:19 ` Tay Ray Chuan
@ 2010-10-21 19:54 ` Enrico Weigelt
2010-10-21 21:23 ` Shawn Pearce
0 siblings, 1 reply; 9+ messages in thread
From: Enrico Weigelt @ 2010-10-21 19:54 UTC (permalink / raw)
To: git@vger.kernel.org
* Tay Ray Chuan <rctay89@gmail.com> wrote:
> Hi,
>
> On Thu, Oct 21, 2010 at 8:52 PM, Wilson, Kevin Lee (OpenView Engineer)
> <kevin.l.wilson@hp.com> wrote:
> > We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old?
>
> check this out:
>
> http://github.com/apenwarr/bup
>
> It's a modified git system that's purpose-built for large files.
> That's just about all the sensible information I can share you with
> you.
Looks quite promising, perhaps it can help solving my current
backup problems.
Maybe we could implement some of the features, eg. the hashsplit
format (maybe it could even combined w/ xdelta ?) or the bupindex
and some extended metadata directly in git ?
BTW: how are the current tree objects structured ? Is there always
one big tree object representing the whole tree or can it be
splitted hierachically ?
cu
--
----------------------------------------------------------------------
Enrico Weigelt, metux IT service -- http://www.metux.de/
phone: +49 36207 519931 email: weigelt@metux.de
mobile: +49 151 27565287 icq: 210169427 skype: nekrad666
----------------------------------------------------------------------
Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository
2010-10-21 19:54 ` Enrico Weigelt
@ 2010-10-21 21:23 ` Shawn Pearce
2010-10-22 5:02 ` Enrico Weigelt
0 siblings, 1 reply; 9+ messages in thread
From: Shawn Pearce @ 2010-10-21 21:23 UTC (permalink / raw)
To: weigelt, git@vger.kernel.org
On Thu, Oct 21, 2010 at 12:54 PM, Enrico Weigelt <weigelt@metux.de> wrote:
> BTW: how are the current tree objects structured ? Is there always
> one big tree object representing the whole tree or can it be
> splitted hierachically ?
Trees in the repository are split hierarchically, but its a flat list
in the dircache/index file (aka $GIT_DIR/index).
--
Shawn.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository
2010-10-21 21:23 ` Shawn Pearce
@ 2010-10-22 5:02 ` Enrico Weigelt
2010-10-22 19:20 ` Shawn Pearce
0 siblings, 1 reply; 9+ messages in thread
From: Enrico Weigelt @ 2010-10-22 5:02 UTC (permalink / raw)
To: git@vger.kernel.org
* Shawn Pearce <spearce@spearce.org> wrote:
> On Thu, Oct 21, 2010 at 12:54 PM, Enrico Weigelt <weigelt@metux.de> wrote:
> > BTW: how are the current tree objects structured ? Is there always
> > one big tree object representing the whole tree or can it be
> > splitted hierachically ?
>
> Trees in the repository are split hierarchically, but its a flat list
> in the dircache/index file (aka $GIT_DIR/index).
Good. That should allow optimizations for large trees w/o
having to change object formats.
What about splitted blobs ? Is this supported by the object
format right now ?
cu
--
----------------------------------------------------------------------
Enrico Weigelt, metux IT service -- http://www.metux.de/
phone: +49 36207 519931 email: weigelt@metux.de
mobile: +49 151 27565287 icq: 210169427 skype: nekrad666
----------------------------------------------------------------------
Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository
2010-10-22 5:02 ` Enrico Weigelt
@ 2010-10-22 19:20 ` Shawn Pearce
0 siblings, 0 replies; 9+ messages in thread
From: Shawn Pearce @ 2010-10-22 19:20 UTC (permalink / raw)
To: weigelt, git@vger.kernel.org
On Thu, Oct 21, 2010 at 10:02 PM, Enrico Weigelt <weigelt@metux.de> wrote:
> * Shawn Pearce <spearce@spearce.org> wrote:
> What about splitted blobs ? Is this supported by the object
> format right now ?
Nope.
--
Shawn.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-10-22 19:20 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-21 12:52 GIT as binary repository Wilson, Kevin Lee (OpenView Engineer)
2010-10-21 14:19 ` Tay Ray Chuan
2010-10-21 19:54 ` Enrico Weigelt
2010-10-21 21:23 ` Shawn Pearce
2010-10-22 5:02 ` Enrico Weigelt
2010-10-22 19:20 ` Shawn Pearce
2010-10-21 17:38 ` Shawn Pearce
2010-10-21 18:53 ` Wilson, Kevin Lee (OpenView Engineer)
2010-10-21 19:19 ` Enrico Weigelt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).