* GIT as binary repository @ 2010-10-21 12:52 Wilson, Kevin Lee (OpenView Engineer) 2010-10-21 14:19 ` Tay Ray Chuan ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Wilson, Kevin Lee (OpenView Engineer) @ 2010-10-21 12:52 UTC (permalink / raw) To: git@vger.kernel.org Hello, We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old? I also have some questions, about how the workflow would be for getting all of the changes merged from several different teams into the one repository would operate. Do we setup a shared system for engineers to perform the merges onto? Our teams are geographically disbursed. Thanks for any light you can shed on this. We are trying to digest a lot of information quickly, so sorry if there are things covered here that are elsewhere. Thanks, Kevin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository 2010-10-21 12:52 GIT as binary repository Wilson, Kevin Lee (OpenView Engineer) @ 2010-10-21 14:19 ` Tay Ray Chuan 2010-10-21 19:54 ` Enrico Weigelt 2010-10-21 17:38 ` Shawn Pearce 2010-10-21 19:19 ` Enrico Weigelt 2 siblings, 1 reply; 9+ messages in thread From: Tay Ray Chuan @ 2010-10-21 14:19 UTC (permalink / raw) To: Wilson, Kevin Lee (OpenView Engineer); +Cc: git@vger.kernel.org Hi, On Thu, Oct 21, 2010 at 8:52 PM, Wilson, Kevin Lee (OpenView Engineer) <kevin.l.wilson@hp.com> wrote: > We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old? check this out: http://github.com/apenwarr/bup It's a modified git system that's purpose-built for large files. That's just about all the sensible information I can share you with you. -- Cheers, Ray Chuan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository 2010-10-21 14:19 ` Tay Ray Chuan @ 2010-10-21 19:54 ` Enrico Weigelt 2010-10-21 21:23 ` Shawn Pearce 0 siblings, 1 reply; 9+ messages in thread From: Enrico Weigelt @ 2010-10-21 19:54 UTC (permalink / raw) To: git@vger.kernel.org * Tay Ray Chuan <rctay89@gmail.com> wrote: > Hi, > > On Thu, Oct 21, 2010 at 8:52 PM, Wilson, Kevin Lee (OpenView Engineer) > <kevin.l.wilson@hp.com> wrote: > > We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old? > > check this out: > > http://github.com/apenwarr/bup > > It's a modified git system that's purpose-built for large files. > That's just about all the sensible information I can share you with > you. Looks quite promising, perhaps it can help solving my current backup problems. Maybe we could implement some of the features, eg. the hashsplit format (maybe it could even combined w/ xdelta ?) or the bupindex and some extended metadata directly in git ? BTW: how are the current tree objects structured ? Is there always one big tree object representing the whole tree or can it be splitted hierachically ? cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository 2010-10-21 19:54 ` Enrico Weigelt @ 2010-10-21 21:23 ` Shawn Pearce 2010-10-22 5:02 ` Enrico Weigelt 0 siblings, 1 reply; 9+ messages in thread From: Shawn Pearce @ 2010-10-21 21:23 UTC (permalink / raw) To: weigelt, git@vger.kernel.org On Thu, Oct 21, 2010 at 12:54 PM, Enrico Weigelt <weigelt@metux.de> wrote: > BTW: how are the current tree objects structured ? Is there always > one big tree object representing the whole tree or can it be > splitted hierachically ? Trees in the repository are split hierarchically, but its a flat list in the dircache/index file (aka $GIT_DIR/index). -- Shawn. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository 2010-10-21 21:23 ` Shawn Pearce @ 2010-10-22 5:02 ` Enrico Weigelt 2010-10-22 19:20 ` Shawn Pearce 0 siblings, 1 reply; 9+ messages in thread From: Enrico Weigelt @ 2010-10-22 5:02 UTC (permalink / raw) To: git@vger.kernel.org * Shawn Pearce <spearce@spearce.org> wrote: > On Thu, Oct 21, 2010 at 12:54 PM, Enrico Weigelt <weigelt@metux.de> wrote: > > BTW: how are the current tree objects structured ? Is there always > > one big tree object representing the whole tree or can it be > > splitted hierachically ? > > Trees in the repository are split hierarchically, but its a flat list > in the dircache/index file (aka $GIT_DIR/index). Good. That should allow optimizations for large trees w/o having to change object formats. What about splitted blobs ? Is this supported by the object format right now ? cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository 2010-10-22 5:02 ` Enrico Weigelt @ 2010-10-22 19:20 ` Shawn Pearce 0 siblings, 0 replies; 9+ messages in thread From: Shawn Pearce @ 2010-10-22 19:20 UTC (permalink / raw) To: weigelt, git@vger.kernel.org On Thu, Oct 21, 2010 at 10:02 PM, Enrico Weigelt <weigelt@metux.de> wrote: > * Shawn Pearce <spearce@spearce.org> wrote: > What about splitted blobs ? Is this supported by the object > format right now ? Nope. -- Shawn. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository 2010-10-21 12:52 GIT as binary repository Wilson, Kevin Lee (OpenView Engineer) 2010-10-21 14:19 ` Tay Ray Chuan @ 2010-10-21 17:38 ` Shawn Pearce 2010-10-21 18:53 ` Wilson, Kevin Lee (OpenView Engineer) 2010-10-21 19:19 ` Enrico Weigelt 2 siblings, 1 reply; 9+ messages in thread From: Shawn Pearce @ 2010-10-21 17:38 UTC (permalink / raw) To: Wilson, Kevin Lee (OpenView Engineer); +Cc: git@vger.kernel.org On Thu, Oct 21, 2010 at 5:52 AM, Wilson, Kevin Lee (OpenView Engineer) <kevin.l.wilson@hp.com> wrote: > We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old? Not really. Teams who need to store content like this are taking two approaches: Bite the bullet and use 64 bit systems with a lot of physical memory. Git allocates/memmaps two memory blocks equal in size to the file you are trying to work with. If you have an 800MB file, you need ~1.6G of physical memory just for the Git executable to touch that file. For most modern desktops and server systems, this is pretty easy to deal with, 4G or 8G of physical memory in a developer workstation is pretty inexpensive. If the files aren't delta compressible, you can speed up delta compressing operations that occur during `git gc` by adding a gitattribute with the "-delta" flag for the relevant path files to your .git/info/attributes file. Unfortunately this may mean that your Git repository is large (>20G?), and each developer needs to make a full copy of it when they start to work on the project. That is a lot of data to move around, or to store locally. But again when you look at the cost of disk on a developer workstation, this may not be an issue if your team can adopt a workflow where they don't clone the repository often. (E.g. the Android repository is about 7G, developers clone it once and then don't need to again... so it is doable.) The other option is to use a different repository for the binary files. Some teams are using a REST enabled HTTP server like Amazon S3 (though you probably want something inside your corporate firewall) to store the large binary files. Instead of putting the files into Git the put a small shell script and a pointer to the file into Git. The shell script downloads the large binary file when executed, and the build process (or the developer "start-up" instructions) execute the script to get the latest versions bootstrapped on the local workstation. > I also have some questions, about how the workflow would be for getting all of the changes merged from several different teams into the one repository would operate. Do we setup a shared system for engineers to perform the merges onto? Our teams are geographically disbursed. Yes, this is the common approach. Actually what I have started to see with Android is, each distributed office has a shared repository that the engineers in that office interact with on a daily basis. And someone in each office synchronizes that repository with a single central repository that exists somewhere else. Because of the nature of Git, the central repository can be continuously pulled into the distributed office through a cron script. Engineers in the office can therefore always have a "fairly latest" version available, but can also fork off onto a side-branch and defer merging with the other offices for a day or two. Android teams are successfully using this approach by running Gerrit Code Review[1] as their central server, and using Gerrit's built-in replication feature to push updates to the distributed office servers. In effect there is one central server for writes, but a lot of read operations are offloaded into the distributed offices local copies. [1] http://code.google.com/p/gerrit/ -- Shawn. ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: GIT as binary repository 2010-10-21 17:38 ` Shawn Pearce @ 2010-10-21 18:53 ` Wilson, Kevin Lee (OpenView Engineer) 0 siblings, 0 replies; 9+ messages in thread From: Wilson, Kevin Lee (OpenView Engineer) @ 2010-10-21 18:53 UTC (permalink / raw) To: Shawn Pearce; +Cc: git@vger.kernel.org Thanks for the detailed input. Kevin -----Original Message----- From: Shawn Pearce [mailto:spearce@spearce.org] Sent: Thursday, October 21, 2010 11:39 AM To: Wilson, Kevin Lee (OpenView Engineer) Cc: git@vger.kernel.org Subject: Re: GIT as binary repository On Thu, Oct 21, 2010 at 5:52 AM, Wilson, Kevin Lee (OpenView Engineer) <kevin.l.wilson@hp.com> wrote: > We are investigating the use of GIT as a binary repository solution. Our larger files are near 800MB and the total checked out repo size is about 3 GB the repo size in SVN is more like 20-30GB, if we could prune the history prior to MR, we could get these sizes down considerably. This binary repo is really for our super project build. From what I have read and learned, this is not a good fit for the GIT tool. Have there been performance improvements lately? Some of the posts I have read have been quite old? Not really. Teams who need to store content like this are taking two approaches: Bite the bullet and use 64 bit systems with a lot of physical memory. Git allocates/memmaps two memory blocks equal in size to the file you are trying to work with. If you have an 800MB file, you need ~1.6G of physical memory just for the Git executable to touch that file. For most modern desktops and server systems, this is pretty easy to deal with, 4G or 8G of physical memory in a developer workstation is pretty inexpensive. If the files aren't delta compressible, you can speed up delta compressing operations that occur during `git gc` by adding a gitattribute with the "-delta" flag for the relevant path files to your .git/info/attributes file. Unfortunately this may mean that your Git repository is large (>20G?), and each developer needs to make a full copy of it when they start to work on the project. That is a lot of data to move around, or to store locally. But again when you look at the cost of disk on a developer workstation, this may not be an issue if your team can adopt a workflow where they don't clone the repository often. (E.g. the Android repository is about 7G, developers clone it once and then don't need to again... so it is doable.) The other option is to use a different repository for the binary files. Some teams are using a REST enabled HTTP server like Amazon S3 (though you probably want something inside your corporate firewall) to store the large binary files. Instead of putting the files into Git the put a small shell script and a pointer to the file into Git. The shell script downloads the large binary file when executed, and the build process (or the developer "start-up" instructions) execute the script to get the latest versions bootstrapped on the local workstation. > I also have some questions, about how the workflow would be for getting all of the changes merged from several different teams into the one repository would operate. Do we setup a shared system for engineers to perform the merges onto? Our teams are geographically disbursed. Yes, this is the common approach. Actually what I have started to see with Android is, each distributed office has a shared repository that the engineers in that office interact with on a daily basis. And someone in each office synchronizes that repository with a single central repository that exists somewhere else. Because of the nature of Git, the central repository can be continuously pulled into the distributed office through a cron script. Engineers in the office can therefore always have a "fairly latest" version available, but can also fork off onto a side-branch and defer merging with the other offices for a day or two. Android teams are successfully using this approach by running Gerrit Code Review[1] as their central server, and using Gerrit's built-in replication feature to push updates to the distributed office servers. In effect there is one central server for writes, but a lot of read operations are offloaded into the distributed offices local copies. [1] http://code.google.com/p/gerrit/ -- Shawn. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: GIT as binary repository 2010-10-21 12:52 GIT as binary repository Wilson, Kevin Lee (OpenView Engineer) 2010-10-21 14:19 ` Tay Ray Chuan 2010-10-21 17:38 ` Shawn Pearce @ 2010-10-21 19:19 ` Enrico Weigelt 2 siblings, 0 replies; 9+ messages in thread From: Enrico Weigelt @ 2010-10-21 19:19 UTC (permalink / raw) To: git@vger.kernel.org * Wilson, Kevin Lee (OpenView Engineer) <kevin.l.wilson@hp.com> wrote: Hi, > We are investigating the use of GIT as a binary repository solution. > Our larger files are near 800MB and the total checked out repo size > is about 3 GB the repo size in SVN is more like 20-30GB, if we could > prune the history prior to MR, we could get these sizes down > considerably. This binary repo is really for our super project build. What exactly do you need such large binary objects in an git repo ? IMHO, Git isn't made for large files. I've noticed this when doing git-based mail archives on an old P3 box w/ 256MB physical memory. I had to split mbox'es to maildirs. Perhaps you would like to have a look at some pure object store like venti ? (It's not distributed yet, but I'm currently working on an distributed successor, called Nebulon, which will also support strong encryption, on-demand replication, etc). > I also have some questions, about how the workflow would be for > getting all of the changes merged from several different teams > into the one repository would operate. IMHO, there should be some dedicated release manager role, which is responsible for merging finished branches into the mainline (eg. similar that Linus does for the official Linux tree). BUT: you perhaps should think carefully, whether you need everything in one big repo. Perhaps a bunch of smaller ones (eg. having separate modules in the own repos/trees) would fit better. cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-10-22 19:20 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-10-21 12:52 GIT as binary repository Wilson, Kevin Lee (OpenView Engineer) 2010-10-21 14:19 ` Tay Ray Chuan 2010-10-21 19:54 ` Enrico Weigelt 2010-10-21 21:23 ` Shawn Pearce 2010-10-22 5:02 ` Enrico Weigelt 2010-10-22 19:20 ` Shawn Pearce 2010-10-21 17:38 ` Shawn Pearce 2010-10-21 18:53 ` Wilson, Kevin Lee (OpenView Engineer) 2010-10-21 19:19 ` Enrico Weigelt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).