* blobs (once more) @ 2011-04-06 8:09 Pau Garcia i Quiles 2011-04-06 9:25 ` Johannes Schindelin ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Pau Garcia i Quiles @ 2011-04-06 8:09 UTC (permalink / raw) To: Git Mailing List; +Cc: Johannes Schindelin Hello, Binary large objects. I know it has been discussed once and again but I'd like to know if there is something new. Some corporation hired the company I work for one year ago to develop a large application. They imposed ClearCase as the VCS. I don't know if you have used it but it is a pain in the ass. We have lost weeks of development to site-replication problems, funny merges, etc. We are trying to migrate our project to git, which we have experience with. One very important point in this project (which is Windows only) is putting binaries in the repository. So far, we have suceeded in not doing that in other projects but we will need to do that in this project. In the Windows world, it is not unusual to use third-party libraries which are only available in binary form. Getting them as source is not an option because the companies developing them are not selling the source. Moving from those binary-only dependencies to something else is not an option either because what we are using has some unique features, be it technical features or support features. In our project, we have about a dozen such binaries, ranging from a few hundred kilobytes, to a couple hundred megabytes (proprietary database and virtualization engine). The usual answer to the "I need to put binaries in the repository" question has been "no, you do not". Well, we do. We are in heavy development now, therefore today's version may depend on a certain version of a third-party shared library (DLL) which we only can get in binary form, and tomorrow's version may depend on the next version of that library, and you cannot mix today's source with yesterday's third-party DLL. I. e. to be able to use the code from 7 days ago at 11.07 AM you need "git checkout" to "return" our source AND the binaries we were using back then. This is something ClearCase manages satisfactorily. I have read about: - submodules + using different repositories once one "blob repository" grows too much. This will be probably rejected because it is quite contrived. - git-annex (does not get the files in when cloning, pulling, checking out; you need to do it manually) - git-media (same as git-annex) - boar (no, we do not want to use a VCS for binaries in addition to git) - and a few more So far the only good solution seems to be git-bigfiles but it's still in development. Is there any good solution for my use case, where version = sources version + binaries version? Thank you. If we suceed with git here, the whole corportation (150,000+ employees, Fortune 500) may start to move to git in a year. Many people are fed up with CC there. -- Pau Garcia i Quiles http://www.elpauer.org (Due to my workload, I may need 10 days to answer) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: blobs (once more) 2011-04-06 8:09 blobs (once more) Pau Garcia i Quiles @ 2011-04-06 9:25 ` Johannes Schindelin 2011-04-06 12:20 ` Michael J Gruber 2011-04-06 14:14 ` Martin Langhoff 2011-04-06 11:06 ` Matthieu Moy 2011-04-07 5:20 ` Miles Bader 2 siblings, 2 replies; 9+ messages in thread From: Johannes Schindelin @ 2011-04-06 9:25 UTC (permalink / raw) To: Pau Garcia i Quiles; +Cc: Git Mailing List Hi, On Wed, 6 Apr 2011, Pau Garcia i Quiles wrote: > Binary large objects. I know it has been discussed once and again but > I'd like to know if there is something new. > > Some corporation hired the company I work for one year ago to develop a > large application. They imposed ClearCase as the VCS. I don't know if > you have used it but it is a pain in the ass. We have lost weeks of > development to site-replication problems, funny merges, etc. We are > trying to migrate our project to git, which we have experience with. > > One very important point in this project (which is Windows only) is > putting binaries in the repository. So far, we have suceeded in not > doing that in other projects but we will need to do that in this > project. > > In the Windows world, it is not unusual to use third-party libraries > which are only available in binary form. Getting them as source is not > an option because the companies developing them are not selling the > source. Moving from those binary-only dependencies to something else is > not an option either because what we are using has some unique features, > be it technical features or support features. In our project, we have > about a dozen such binaries, ranging from a few hundred kilobytes, to a > couple hundred megabytes (proprietary database and virtualization > engine). > > The usual answer to the "I need to put binaries in the repository" > question has been "no, you do not". Well, we do. We are in heavy > development now, therefore today's version may depend on a certain > version of a third-party shared library (DLL) which we only can get in > binary form, and tomorrow's version may depend on the next version of > that library, and you cannot mix today's source with yesterday's > third-party DLL. I. e. to be able to use the code from 7 days ago at > 11.07 AM you need "git checkout" to "return" our source AND the binaries > we were using back then. This is something ClearCase manages > satisfactorily. I understand. The problem in your case might not be too bad, after all. The problem only arises when you have big files that are compressed. If you check in multiple versions of an uncompressed .dll file, Git will usually do a very good job at compressing them. If they are compressed, what you probably need is something like a sparse clone, which is sort of available in the form of shallow clones, but it is too limited still. Having said that, in another company I work for, they hav 20G repositories and they will grow larger. That is something they incurred due to historical reasons, and they are willing to pay the price in terms of disk space. Due to Git's distributed nature, they had no problems with cloning; they just use a local reference upon initial clone. > I have read about: > - submodules + using different repositories once one "blob repository" > grows too much. This will be probably rejected because it is quite > contrived. I would also recommend against this, because submodules are a very weak part of Git. > - git-annex (does not get the files in when cloning, pulling, checking > out; you need to do it manually) > - git-media (same as git-annex) Yes, this is an option, but a bit klunky. > - boar (no, we do not want to use a VCS for binaries in addition to git) I did not know about that. > - and a few more > > So far the only good solution seems to be git-bigfiles but it's still > in development. It has stalled, apparently, but I wanted to have a look at it anyway. Will let you know of my findings! > Is there any good solution for my use case, where version = sources > version + binaries version? > > Thank you. > > If we suceed with git here, the whole corportation (150,000+ > employees, Fortune 500) may start to move to git in a year. Many > people are fed up with CC there. Ciao, Johannes ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: blobs (once more) 2011-04-06 9:25 ` Johannes Schindelin @ 2011-04-06 12:20 ` Michael J Gruber 2011-04-06 14:14 ` Martin Langhoff 1 sibling, 0 replies; 9+ messages in thread From: Michael J Gruber @ 2011-04-06 12:20 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Pau Garcia i Quiles, Git Mailing List Johannes Schindelin venit, vidit, dixit 06.04.2011 11:25: > Hi, > > On Wed, 6 Apr 2011, Pau Garcia i Quiles wrote: > >> Binary large objects. I know it has been discussed once and again but >> I'd like to know if there is something new. >> >> Some corporation hired the company I work for one year ago to develop a >> large application. They imposed ClearCase as the VCS. I don't know if >> you have used it but it is a pain in the ass. We have lost weeks of >> development to site-replication problems, funny merges, etc. We are >> trying to migrate our project to git, which we have experience with. >> >> One very important point in this project (which is Windows only) is >> putting binaries in the repository. So far, we have suceeded in not >> doing that in other projects but we will need to do that in this >> project. >> >> In the Windows world, it is not unusual to use third-party libraries >> which are only available in binary form. Getting them as source is not >> an option because the companies developing them are not selling the >> source. Moving from those binary-only dependencies to something else is >> not an option either because what we are using has some unique features, >> be it technical features or support features. In our project, we have >> about a dozen such binaries, ranging from a few hundred kilobytes, to a >> couple hundred megabytes (proprietary database and virtualization >> engine). >> >> The usual answer to the "I need to put binaries in the repository" >> question has been "no, you do not". Well, we do. We are in heavy >> development now, therefore today's version may depend on a certain >> version of a third-party shared library (DLL) which we only can get in >> binary form, and tomorrow's version may depend on the next version of >> that library, and you cannot mix today's source with yesterday's >> third-party DLL. I. e. to be able to use the code from 7 days ago at >> 11.07 AM you need "git checkout" to "return" our source AND the binaries >> we were using back then. This is something ClearCase manages >> satisfactorily. > > I understand. The problem in your case might not be too bad, after all. > The problem only arises when you have big files that are compressed. If > you check in multiple versions of an uncompressed .dll file, Git will > usually do a very good job at compressing them. > > If they are compressed, what you probably need is something like a sparse > clone, which is sort of available in the form of shallow clones, but it is > too limited still. > > Having said that, in another company I work for, they hav 20G repositories > and they will grow larger. That is something they incurred due to > historical reasons, and they are willing to pay the price in terms of disk > space. Due to Git's distributed nature, they had no problems with cloning; > they just use a local reference upon initial clone. > >> I have read about: >> - submodules + using different repositories once one "blob repository" >> grows too much. This will be probably rejected because it is quite >> contrived. > > I would also recommend against this, because submodules are a very weak > part of Git. > >> - git-annex (does not get the files in when cloning, pulling, checking >> out; you need to do it manually) >> - git-media (same as git-annex) > > Yes, this is an option, but a bit klunky. > >> - boar (no, we do not want to use a VCS for binaries in addition to git) > > I did not know about that. > >> - and a few more >> >> So far the only good solution seems to be git-bigfiles but it's still >> in development. > > It has stalled, apparently, but I wanted to have a look at it anyway. Will > let you know of my findings! I think in many applications the "download-on-demand" approach which git-annex takes is very important. (I don't know how far our sparse/shallow supports this.) Also, their remote backends look interesting. And no, I don't want Haskell as yet another language for our code base. Fedora handles big files (compressed tar balls) in git with a file store, scripting (fedpkg) and tracking only a text file with hash values ("sources") in git; somehow a baby version of git-annex. The symlink based approach of annex (big file is a symlink to the "object store" which is indexed by blob content sha1) reminds me very much of our notes trees and the way textconv-cache uses it. It feels as if we already have all the pieces in place. (I don't think we need to track big files' contents, only their hashes; this is fast for read-only media, see annex' worm-backend.) Another crazy idea would be to "git replace" big files by place-holders (blob with the big file's sha1 as content) or rather the other way round, but I haven't thought this through. Michael ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: blobs (once more) 2011-04-06 9:25 ` Johannes Schindelin 2011-04-06 12:20 ` Michael J Gruber @ 2011-04-06 14:14 ` Martin Langhoff 1 sibling, 0 replies; 9+ messages in thread From: Martin Langhoff @ 2011-04-06 14:14 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Pau Garcia i Quiles, Git Mailing List On Wed, Apr 6, 2011 at 5:25 AM, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > I understand. The problem in your case might not be too bad, after all. > The problem only arises when you have big files that are compressed. If > you check in multiple versions of an uncompressed .dll file, Git will > usually do a very good job at compressing them. Except when they are very large; in that case git tends to OOM. But just yesterday Junio posted a proposed patch to honour a max file size for compression (search the archive for 'Git exhausts memory' and 'core.bigFileThreshold'. So Pau might be in luck with current git + Junio's patch + enough RAM on the workstations. Pau, I definitely suggest you try it out. If it still consumes too much memory with the largest filest (ie: the VM images you mention), the fedpkg approach (discussed in this thread) is good too. It's a Python wrapper around git, which tracks a text file listing the hashes of the large files, and fetches them (or uploads them) via SCP. You only need to use it when dealing with the large files -- most of the time you're just using git. The fedpkg code is quite readable, and I've already "stolen" some of its code for my local needs. Recommended. cheers, m -- martin.langhoff@gmail.com martin@laptop.org -- Software Architect - OLPC - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: blobs (once more) 2011-04-06 8:09 blobs (once more) Pau Garcia i Quiles 2011-04-06 9:25 ` Johannes Schindelin @ 2011-04-06 11:06 ` Matthieu Moy 2011-04-06 11:12 ` Peter Jönsson P 2011-04-07 5:20 ` Miles Bader 2 siblings, 1 reply; 9+ messages in thread From: Matthieu Moy @ 2011-04-06 11:06 UTC (permalink / raw) To: Pau Garcia i Quiles; +Cc: Git Mailing List, Johannes Schindelin Pau Garcia i Quiles <pgquiles@elpauer.org> writes: > I have read about: > - submodules + using different repositories once one "blob repository" > grows too much. This will be probably rejected because it is quite > contrived. > - git-annex (does not get the files in when cloning, pulling, checking > out; you need to do it manually) > - git-media (same as git-annex) > - boar (no, we do not want to use a VCS for binaries in addition to git) > - and a few more > > So far the only good solution seems to be git-bigfiles but it's still > in development. This seems to be the hot topic of the moment in Git ;-). Loot at the mailing-list's archive, like http://thread.gmane.org/gmane.comp.version-control.git/170649 https://git.wiki.kernel.org/index.php/SoC2011Ideas#Better_big-file_support there may be a GSoC on the topic. I don't think there's a really good solution for you right now (but I didn't really look closely), but the situation is improving. -- Matthieu Moy http://www-verimag.imag.fr/~moy/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: blobs (once more) 2011-04-06 11:06 ` Matthieu Moy @ 2011-04-06 11:12 ` Peter Jönsson P 2011-04-06 16:42 ` Magnus Bäck 0 siblings, 1 reply; 9+ messages in thread From: Peter Jönsson P @ 2011-04-06 11:12 UTC (permalink / raw) To: Matthieu Moy, Pau Garcia i Quiles Cc: Git Mailing List, Johannes Schindelin, Andrey Devyatkin Hi! How about using Google's repo-tool instead of submodules? Is that a better solution if one really needs to keep binaries together (in some way) with the source code? We are starting to prototype Git and since we need to distribute cross-compilers with the code it would be nice to keep it in a separate repo. Currently we are _heavy_ ClearCase users (oh the horror) with many bad old habits that we are trying to break :) // Peter -----Original Message----- From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On Behalf Of Matthieu Moy Sent: den 6 april 2011 13:06 To: Pau Garcia i Quiles Cc: Git Mailing List; Johannes Schindelin Subject: Re: blobs (once more) Pau Garcia i Quiles <pgquiles@elpauer.org> writes: > I have read about: > - submodules + using different repositories once one "blob repository" > grows too much. This will be probably rejected because it is quite > contrived. > - git-annex (does not get the files in when cloning, pulling, checking > out; you need to do it manually) > - git-media (same as git-annex) > - boar (no, we do not want to use a VCS for binaries in addition to > git) > - and a few more > > So far the only good solution seems to be git-bigfiles but it's still > in development. This seems to be the hot topic of the moment in Git ;-). Loot at the mailing-list's archive, like http://thread.gmane.org/gmane.comp.version-control.git/170649 https://git.wiki.kernel.org/index.php/SoC2011Ideas#Better_big-file_support there may be a GSoC on the topic. I don't think there's a really good solution for you right now (but I didn't really look closely), but the situation is improving. -- Matthieu Moy http://www-verimag.imag.fr/~moy/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: blobs (once more) 2011-04-06 11:12 ` Peter Jönsson P @ 2011-04-06 16:42 ` Magnus Bäck 0 siblings, 0 replies; 9+ messages in thread From: Magnus Bäck @ 2011-04-06 16:42 UTC (permalink / raw) To: Peter Jönsson P Cc: Matthieu Moy, Pau Garcia i Quiles, Git Mailing List, Johannes Schindelin, Andrey Devyatkin On Wednesday, April 06, 2011 at 13:12 CEST, Peter Jönsson P <peter.p.jonsson@ericsson.com> wrote: > How about using Google's repo-tool instead of submodules? Is that > a better solution if one really needs to keep binaries together > (in some way) with the source code? I don't know if Repo vs. submodules is the most interesting choice; they're both relying on plain gits for the storage. We stored large binaries in gits for a while too, but for us the major hassle was that it was killing our servers. When too many of the few hundreds of developers were fetching multi-GB repos the 8-16 core 24 GB RAM server (don't know the exact specs at the time) was really down on its knees. Now, we were using JGit (via Gerrit) and I'm sure tuning could've helped -- as well as just replicating the data and spreading the load across multiple servers -- but don't forget this factor when choosing the tool. > We are starting to prototype Git and since we need to distribute > cross-compilers with the code it would be nice to keep it in a > separate repo. Currently we are _heavy_ ClearCase users (oh the > horror) with many bad old habits that we are trying to break :) We've been there too. -- Magnus Bäck Opinions are my own and do not necessarily SW Configuration Manager represent the ones of my employer, etc. Sony Ericsson ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: blobs (once more) 2011-04-06 8:09 blobs (once more) Pau Garcia i Quiles 2011-04-06 9:25 ` Johannes Schindelin 2011-04-06 11:06 ` Matthieu Moy @ 2011-04-07 5:20 ` Miles Bader 2011-04-07 6:45 ` Johannes Schindelin 2 siblings, 1 reply; 9+ messages in thread From: Miles Bader @ 2011-04-07 5:20 UTC (permalink / raw) To: Pau Garcia i Quiles; +Cc: Git Mailing List, Johannes Schindelin Pau Garcia i Quiles <pgquiles@elpauer.org> writes: > The usual answer to the "I need to put binaries in the repository" > question has been "no, you do not". Well, we do. We are in heavy > development now, therefore today's version may depend on a certain > version of a third-party shared library (DLL) which we only can get in > binary form, and tomorrow's version may depend on the next version of > that library, and you cannot mix today's source with yesterday's > third-party DLL. I. e. to be able to use the code from 7 days ago at > 11.07 AM you need "git checkout" to "return" our source AND the > binaries we were using back then. This is something ClearCase manages > satisfactorily. If it were me, I'd just store the huge binaries in some sort of separate remote filesystem, and then store the remote-file-system _paths_ to them in git (in a simple text file). Then either use the build system or some sort of git filter to make sure that the actual library was installed before building based on the path read from the file in git. [This would be a pain as a _general_ solution (for git), because it involves coordination with a the remote file system, etc, but for an organization like yours setting up a system for a specific product, it should be fairly easy to set up and maintain -- and particularly so if the main use is to store 3rd party library releases, as they're typically not going to be something that anybody will want to checkin, but rather installed by a small set of people.] -Miles -- Circus, n. A place where horses, ponies and elephants are permitted to see men, women and children acting the fool. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: blobs (once more) 2011-04-07 5:20 ` Miles Bader @ 2011-04-07 6:45 ` Johannes Schindelin 0 siblings, 0 replies; 9+ messages in thread From: Johannes Schindelin @ 2011-04-07 6:45 UTC (permalink / raw) To: Miles Bader; +Cc: Pau Garcia i Quiles, Git Mailing List Hi, On Thu, 7 Apr 2011, Miles Bader wrote: > Pau Garcia i Quiles <pgquiles@elpauer.org> writes: > > The usual answer to the "I need to put binaries in the repository" > > question has been "no, you do not". Well, we do. We are in heavy > > development now, therefore today's version may depend on a certain > > version of a third-party shared library (DLL) which we only can get in > > binary form, and tomorrow's version may depend on the next version of > > that library, and you cannot mix today's source with yesterday's > > third-party DLL. I. e. to be able to use the code from 7 days ago at > > 11.07 AM you need "git checkout" to "return" our source AND the > > binaries we were using back then. This is something ClearCase manages > > satisfactorily. > > If it were me, I'd just store the huge binaries in some sort of separate > remote filesystem, and then store the remote-file-system _paths_ to them > in git (in a simple text file). That fails for a number of reasons: - it does not pass the 30,000-feet-high test - integrity is not guaranteed (anybody can edit the files on the remote file system, and nobody would realize that a "git checkout HEAD~2000" ends up being something different from before) - you would have to reinvent an efficient transfer (e.g. taking into account all the data we have already) - storage is no longer efficient, especially if you have multiple versions of the same file. - it is no longer decentralized anymore. Just think about yourself sitting in the middle of antarctica, desperately needing to match a penguin against a database of known penguins. You definitely want to have the database local instead of leeching it down the non-existing wire all the time. Likewise, if you and your group sit, say, on Viti Levu, and develop software with people from New York, Texas, you definitely want a repository-in-the-middle, making it one person's duty to synchronize, say, once per day. I am sure you can think of more reasons. Ciao, Johannes ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-04-07 6:46 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-06 8:09 blobs (once more) Pau Garcia i Quiles 2011-04-06 9:25 ` Johannes Schindelin 2011-04-06 12:20 ` Michael J Gruber 2011-04-06 14:14 ` Martin Langhoff 2011-04-06 11:06 ` Matthieu Moy 2011-04-06 11:12 ` Peter Jönsson P 2011-04-06 16:42 ` Magnus Bäck 2011-04-07 5:20 ` Miles Bader 2011-04-07 6:45 ` Johannes Schindelin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).