git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* blobs (once more)
@ 2011-04-06  8:09 Pau Garcia i Quiles
  2011-04-06  9:25 ` Johannes Schindelin
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Pau Garcia i Quiles @ 2011-04-06  8:09 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Johannes Schindelin

Hello,

Binary large objects. I know it has been discussed once and again but
I'd like to know if there is something new.

Some corporation hired the company I work for one year ago to develop
a large application. They imposed ClearCase as the VCS. I don't know
if you have used it but it is a pain in the ass. We have lost weeks of
development to site-replication problems, funny merges, etc. We are
trying to migrate our project to git, which we have experience with.

One very important point in this project (which is Windows only) is
putting binaries in the repository. So far, we have suceeded in not
doing that in other projects but we will need to do that in this
project.

In the Windows world, it is not unusual to use third-party libraries
which are only available in binary form. Getting them as source is not
an option because the companies developing them are not selling the
source. Moving from those binary-only dependencies to something else
is not an option either because what we are using has some unique
features, be it technical features or support features. In our
project, we have about a dozen such binaries, ranging from a few
hundred kilobytes, to a couple hundred megabytes (proprietary database
and virtualization engine).

The usual answer to the "I need to put binaries in the repository"
question has been "no, you do not". Well, we do. We are in heavy
development now, therefore today's version may depend on a certain
version of a third-party shared library (DLL) which we only can get in
binary form, and tomorrow's version may depend on the next version of
that library, and you cannot mix today's source with yesterday's
third-party DLL. I. e. to be able to use the code from 7 days ago at
11.07 AM you need "git checkout" to "return" our source AND the
binaries we were using back then. This is something ClearCase manages
satisfactorily.

I have read about:
- submodules + using different repositories once one "blob repository"
grows too much. This will be probably rejected because it is quite
contrived.
- git-annex (does not get the files in when cloning, pulling, checking
out; you need to do it manually)
- git-media (same as git-annex)
- boar (no, we do not want to use a VCS for binaries in addition to git)
- and a few more

So far the only good solution seems to be git-bigfiles but it's still
in development.

Is there any good solution for my use case, where version = sources
version + binaries version?

Thank you.

If we suceed with git here, the whole corportation (150,000+
employees, Fortune 500) may start to move to git in a year. Many
people are fed up with CC there.

-- 
Pau Garcia i Quiles
http://www.elpauer.org
(Due to my workload, I may need 10 days to answer)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blobs (once more)
  2011-04-06  8:09 blobs (once more) Pau Garcia i Quiles
@ 2011-04-06  9:25 ` Johannes Schindelin
  2011-04-06 12:20   ` Michael J Gruber
  2011-04-06 14:14   ` Martin Langhoff
  2011-04-06 11:06 ` Matthieu Moy
  2011-04-07  5:20 ` Miles Bader
  2 siblings, 2 replies; 9+ messages in thread
From: Johannes Schindelin @ 2011-04-06  9:25 UTC (permalink / raw)
  To: Pau Garcia i Quiles; +Cc: Git Mailing List

Hi,

On Wed, 6 Apr 2011, Pau Garcia i Quiles wrote:

> Binary large objects. I know it has been discussed once and again but 
> I'd like to know if there is something new.
> 
> Some corporation hired the company I work for one year ago to develop a 
> large application. They imposed ClearCase as the VCS. I don't know if 
> you have used it but it is a pain in the ass. We have lost weeks of 
> development to site-replication problems, funny merges, etc. We are 
> trying to migrate our project to git, which we have experience with.
> 
> One very important point in this project (which is Windows only) is 
> putting binaries in the repository. So far, we have suceeded in not 
> doing that in other projects but we will need to do that in this 
> project.
> 
> In the Windows world, it is not unusual to use third-party libraries 
> which are only available in binary form. Getting them as source is not 
> an option because the companies developing them are not selling the 
> source. Moving from those binary-only dependencies to something else is 
> not an option either because what we are using has some unique features, 
> be it technical features or support features. In our project, we have 
> about a dozen such binaries, ranging from a few hundred kilobytes, to a 
> couple hundred megabytes (proprietary database and virtualization 
> engine).
> 
> The usual answer to the "I need to put binaries in the repository" 
> question has been "no, you do not". Well, we do. We are in heavy 
> development now, therefore today's version may depend on a certain 
> version of a third-party shared library (DLL) which we only can get in 
> binary form, and tomorrow's version may depend on the next version of 
> that library, and you cannot mix today's source with yesterday's 
> third-party DLL. I. e. to be able to use the code from 7 days ago at 
> 11.07 AM you need "git checkout" to "return" our source AND the binaries 
> we were using back then. This is something ClearCase manages 
> satisfactorily.

I understand. The problem in your case might not be too bad, after all. 
The problem only arises when you have big files that are compressed. If 
you check in multiple versions of an uncompressed .dll file, Git will 
usually do a very good job at compressing them.

If they are compressed, what you probably need is something like a sparse 
clone, which is sort of available in the form of shallow clones, but it is 
too limited still.

Having said that, in another company I work for, they hav 20G repositories 
and they will grow larger. That is something they incurred due to 
historical reasons, and they are willing to pay the price in terms of disk 
space. Due to Git's distributed nature, they had no problems with cloning; 
they just use a local reference upon initial clone.

> I have read about:
> - submodules + using different repositories once one "blob repository"  
>   grows too much. This will be probably rejected because it is quite 
>   contrived.

I would also recommend against this, because submodules are a very weak 
part of Git.

> - git-annex (does not get the files in when cloning, pulling, checking 
>   out; you need to do it manually)
> - git-media (same as git-annex)

Yes, this is an option, but a bit klunky.

> - boar (no, we do not want to use a VCS for binaries in addition to git)

I did not know about that.

> - and a few more
> 
> So far the only good solution seems to be git-bigfiles but it's still
> in development.

It has stalled, apparently, but I wanted to have a look at it anyway. Will 
let you know of my findings!

> Is there any good solution for my use case, where version = sources 
> version + binaries version?
> 
> Thank you.
> 
> If we suceed with git here, the whole corportation (150,000+
> employees, Fortune 500) may start to move to git in a year. Many
> people are fed up with CC there.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blobs (once more)
  2011-04-06  8:09 blobs (once more) Pau Garcia i Quiles
  2011-04-06  9:25 ` Johannes Schindelin
@ 2011-04-06 11:06 ` Matthieu Moy
  2011-04-06 11:12   ` Peter Jönsson P
  2011-04-07  5:20 ` Miles Bader
  2 siblings, 1 reply; 9+ messages in thread
From: Matthieu Moy @ 2011-04-06 11:06 UTC (permalink / raw)
  To: Pau Garcia i Quiles; +Cc: Git Mailing List, Johannes Schindelin

Pau Garcia i Quiles <pgquiles@elpauer.org> writes:

> I have read about:
> - submodules + using different repositories once one "blob repository"
> grows too much. This will be probably rejected because it is quite
> contrived.
> - git-annex (does not get the files in when cloning, pulling, checking
> out; you need to do it manually)
> - git-media (same as git-annex)
> - boar (no, we do not want to use a VCS for binaries in addition to git)
> - and a few more
>
> So far the only good solution seems to be git-bigfiles but it's still
> in development.

This seems to be the hot topic of the moment in Git ;-). Loot at the
mailing-list's archive, like

http://thread.gmane.org/gmane.comp.version-control.git/170649
https://git.wiki.kernel.org/index.php/SoC2011Ideas#Better_big-file_support

there may be a GSoC on the topic. I don't think there's a really good
solution for you right now (but I didn't really look closely), but the
situation is improving.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: blobs (once more)
  2011-04-06 11:06 ` Matthieu Moy
@ 2011-04-06 11:12   ` Peter Jönsson P
  2011-04-06 16:42     ` Magnus Bäck
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Jönsson P @ 2011-04-06 11:12 UTC (permalink / raw)
  To: Matthieu Moy, Pau Garcia i Quiles
  Cc: Git Mailing List, Johannes Schindelin, Andrey Devyatkin

Hi!

How about using Google's repo-tool instead of submodules? Is that a better solution if one really needs to keep binaries together (in some way) with the source code?

We are starting to prototype Git and since we need to distribute cross-compilers with the code it would be nice to keep it in a separate repo. Currently we are _heavy_ ClearCase users (oh the horror) with many bad old habits that we are trying to break :)

// Peter 

-----Original Message-----
From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On Behalf Of Matthieu Moy
Sent: den 6 april 2011 13:06
To: Pau Garcia i Quiles
Cc: Git Mailing List; Johannes Schindelin
Subject: Re: blobs (once more)

Pau Garcia i Quiles <pgquiles@elpauer.org> writes:

> I have read about:
> - submodules + using different repositories once one "blob repository"
> grows too much. This will be probably rejected because it is quite 
> contrived.
> - git-annex (does not get the files in when cloning, pulling, checking 
> out; you need to do it manually)
> - git-media (same as git-annex)
> - boar (no, we do not want to use a VCS for binaries in addition to 
> git)
> - and a few more
>
> So far the only good solution seems to be git-bigfiles but it's still 
> in development.

This seems to be the hot topic of the moment in Git ;-). Loot at the mailing-list's archive, like

http://thread.gmane.org/gmane.comp.version-control.git/170649
https://git.wiki.kernel.org/index.php/SoC2011Ideas#Better_big-file_support

there may be a GSoC on the topic. I don't think there's a really good solution for you right now (but I didn't really look closely), but the situation is improving.

--
Matthieu Moy
http://www-verimag.imag.fr/~moy/
--
To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blobs (once more)
  2011-04-06  9:25 ` Johannes Schindelin
@ 2011-04-06 12:20   ` Michael J Gruber
  2011-04-06 14:14   ` Martin Langhoff
  1 sibling, 0 replies; 9+ messages in thread
From: Michael J Gruber @ 2011-04-06 12:20 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Pau Garcia i Quiles, Git Mailing List

Johannes Schindelin venit, vidit, dixit 06.04.2011 11:25:
> Hi,
> 
> On Wed, 6 Apr 2011, Pau Garcia i Quiles wrote:
> 
>> Binary large objects. I know it has been discussed once and again but 
>> I'd like to know if there is something new.
>>
>> Some corporation hired the company I work for one year ago to develop a 
>> large application. They imposed ClearCase as the VCS. I don't know if 
>> you have used it but it is a pain in the ass. We have lost weeks of 
>> development to site-replication problems, funny merges, etc. We are 
>> trying to migrate our project to git, which we have experience with.
>>
>> One very important point in this project (which is Windows only) is 
>> putting binaries in the repository. So far, we have suceeded in not 
>> doing that in other projects but we will need to do that in this 
>> project.
>>
>> In the Windows world, it is not unusual to use third-party libraries 
>> which are only available in binary form. Getting them as source is not 
>> an option because the companies developing them are not selling the 
>> source. Moving from those binary-only dependencies to something else is 
>> not an option either because what we are using has some unique features, 
>> be it technical features or support features. In our project, we have 
>> about a dozen such binaries, ranging from a few hundred kilobytes, to a 
>> couple hundred megabytes (proprietary database and virtualization 
>> engine).
>>
>> The usual answer to the "I need to put binaries in the repository" 
>> question has been "no, you do not". Well, we do. We are in heavy 
>> development now, therefore today's version may depend on a certain 
>> version of a third-party shared library (DLL) which we only can get in 
>> binary form, and tomorrow's version may depend on the next version of 
>> that library, and you cannot mix today's source with yesterday's 
>> third-party DLL. I. e. to be able to use the code from 7 days ago at 
>> 11.07 AM you need "git checkout" to "return" our source AND the binaries 
>> we were using back then. This is something ClearCase manages 
>> satisfactorily.
> 
> I understand. The problem in your case might not be too bad, after all. 
> The problem only arises when you have big files that are compressed. If 
> you check in multiple versions of an uncompressed .dll file, Git will 
> usually do a very good job at compressing them.
> 
> If they are compressed, what you probably need is something like a sparse 
> clone, which is sort of available in the form of shallow clones, but it is 
> too limited still.
> 
> Having said that, in another company I work for, they hav 20G repositories 
> and they will grow larger. That is something they incurred due to 
> historical reasons, and they are willing to pay the price in terms of disk 
> space. Due to Git's distributed nature, they had no problems with cloning; 
> they just use a local reference upon initial clone.
> 
>> I have read about:
>> - submodules + using different repositories once one "blob repository"  
>>   grows too much. This will be probably rejected because it is quite 
>>   contrived.
> 
> I would also recommend against this, because submodules are a very weak 
> part of Git.
> 
>> - git-annex (does not get the files in when cloning, pulling, checking 
>>   out; you need to do it manually)
>> - git-media (same as git-annex)
> 
> Yes, this is an option, but a bit klunky.
> 
>> - boar (no, we do not want to use a VCS for binaries in addition to git)
> 
> I did not know about that.
> 
>> - and a few more
>>
>> So far the only good solution seems to be git-bigfiles but it's still
>> in development.
> 
> It has stalled, apparently, but I wanted to have a look at it anyway. Will 
> let you know of my findings!

I think in many applications the "download-on-demand" approach which
git-annex takes is very important. (I don't know how far our
sparse/shallow supports this.) Also, their remote backends look
interesting. And no, I don't want Haskell as yet another language for
our code base.

Fedora handles big files (compressed tar balls) in git with a file
store, scripting (fedpkg) and tracking only a text file with hash values
("sources") in git; somehow a baby version of git-annex.

The symlink based approach of annex (big file is a symlink to the
"object store" which is indexed by blob content sha1) reminds me very
much of our notes trees and the way textconv-cache uses it. It feels as
if we already have all the pieces in place. (I don't think we need to
track big files' contents, only their hashes; this is fast for read-only
media, see annex' worm-backend.)

Another crazy idea would be to "git replace" big files by place-holders
(blob with the big file's sha1 as content) or rather the other way
round, but I haven't thought this through.

Michael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blobs (once more)
  2011-04-06  9:25 ` Johannes Schindelin
  2011-04-06 12:20   ` Michael J Gruber
@ 2011-04-06 14:14   ` Martin Langhoff
  1 sibling, 0 replies; 9+ messages in thread
From: Martin Langhoff @ 2011-04-06 14:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Pau Garcia i Quiles, Git Mailing List

On Wed, Apr 6, 2011 at 5:25 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> I understand. The problem in your case might not be too bad, after all.
> The problem only arises when you have big files that are compressed. If
> you check in multiple versions of an uncompressed .dll file, Git will
> usually do a very good job at compressing them.

Except when they are very large; in that case git tends to OOM. But
just yesterday Junio posted a proposed patch to honour a max file size
for compression (search the archive for 'Git exhausts memory' and
'core.bigFileThreshold'.

So Pau might be in luck with current git + Junio's patch + enough RAM
on the workstations. Pau, I definitely suggest you try it out.

If it still consumes too much memory with the largest filest (ie: the
VM images you mention), the fedpkg approach (discussed in this thread)
is good too. It's a Python wrapper around git, which tracks a text
file listing the hashes of the large files, and fetches them (or
uploads them) via SCP. You only need to use it when dealing with the
large files -- most of the time you're just using git.

The fedpkg code is quite readable, and I've already "stolen" some of
its code for my local needs. Recommended.

cheers,


m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- Software Architect - OLPC
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blobs (once more)
  2011-04-06 11:12   ` Peter Jönsson P
@ 2011-04-06 16:42     ` Magnus Bäck
  0 siblings, 0 replies; 9+ messages in thread
From: Magnus Bäck @ 2011-04-06 16:42 UTC (permalink / raw)
  To: Peter Jönsson P
  Cc: Matthieu Moy, Pau Garcia i Quiles, Git Mailing List,
	Johannes Schindelin, Andrey Devyatkin

On Wednesday, April 06, 2011 at 13:12 CEST,
     Peter Jönsson P <peter.p.jonsson@ericsson.com> wrote:

> How about using Google's repo-tool instead of submodules? Is that
> a better solution if one really needs to keep binaries together
> (in some way) with the source code?

I don't know if Repo vs. submodules is the most interesting choice;
they're both relying on plain gits for the storage. We stored large
binaries in gits for a while too, but for us the major hassle was
that it was killing our servers. When too many of the few hundreds of
developers were fetching multi-GB repos the 8-16 core 24 GB RAM server
(don't know the exact specs at the time) was really down on its knees.
Now, we were using JGit (via Gerrit) and I'm sure tuning could've helped
-- as well as just replicating the data and spreading the load across
multiple servers -- but don't forget this factor when choosing the tool.

> We are starting to prototype Git and since we need to distribute
> cross-compilers with the code it would be nice to keep it in a
> separate repo. Currently we are _heavy_ ClearCase users (oh the
> horror) with many bad old habits that we are trying to break :)

We've been there too.

-- 
Magnus Bäck                   Opinions are my own and do not necessarily
SW Configuration Manager      represent the ones of my employer, etc.
Sony Ericsson

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blobs (once more)
  2011-04-06  8:09 blobs (once more) Pau Garcia i Quiles
  2011-04-06  9:25 ` Johannes Schindelin
  2011-04-06 11:06 ` Matthieu Moy
@ 2011-04-07  5:20 ` Miles Bader
  2011-04-07  6:45   ` Johannes Schindelin
  2 siblings, 1 reply; 9+ messages in thread
From: Miles Bader @ 2011-04-07  5:20 UTC (permalink / raw)
  To: Pau Garcia i Quiles; +Cc: Git Mailing List, Johannes Schindelin

Pau Garcia i Quiles <pgquiles@elpauer.org> writes:
> The usual answer to the "I need to put binaries in the repository"
> question has been "no, you do not". Well, we do. We are in heavy
> development now, therefore today's version may depend on a certain
> version of a third-party shared library (DLL) which we only can get in
> binary form, and tomorrow's version may depend on the next version of
> that library, and you cannot mix today's source with yesterday's
> third-party DLL. I. e. to be able to use the code from 7 days ago at
> 11.07 AM you need "git checkout" to "return" our source AND the
> binaries we were using back then. This is something ClearCase manages
> satisfactorily.

If it were me, I'd just store the huge binaries in some sort of separate
remote filesystem, and then store the remote-file-system _paths_ to them
in git (in a simple text file).

Then either use the build system or some sort of git filter to make sure
that the actual library was installed before building based on the path
read from the file in git.

[This would be a pain as a _general_ solution (for git), because it
involves coordination with a the remote file system, etc, but for an
organization like yours setting up a system for a specific product, it
should be fairly easy to set up and maintain -- and particularly so if
the main use is to store 3rd party library releases, as they're
typically not going to be something that anybody will want to checkin,
but rather installed by a small set of people.]

-Miles

-- 
Circus, n. A place where horses, ponies and elephants are permitted to see
men, women and children acting the fool.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blobs (once more)
  2011-04-07  5:20 ` Miles Bader
@ 2011-04-07  6:45   ` Johannes Schindelin
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Schindelin @ 2011-04-07  6:45 UTC (permalink / raw)
  To: Miles Bader; +Cc: Pau Garcia i Quiles, Git Mailing List

Hi,

On Thu, 7 Apr 2011, Miles Bader wrote:

> Pau Garcia i Quiles <pgquiles@elpauer.org> writes:
> > The usual answer to the "I need to put binaries in the repository" 
> > question has been "no, you do not". Well, we do. We are in heavy 
> > development now, therefore today's version may depend on a certain 
> > version of a third-party shared library (DLL) which we only can get in 
> > binary form, and tomorrow's version may depend on the next version of 
> > that library, and you cannot mix today's source with yesterday's 
> > third-party DLL. I. e. to be able to use the code from 7 days ago at 
> > 11.07 AM you need "git checkout" to "return" our source AND the 
> > binaries we were using back then. This is something ClearCase manages 
> > satisfactorily.
> 
> If it were me, I'd just store the huge binaries in some sort of separate 
> remote filesystem, and then store the remote-file-system _paths_ to them 
> in git (in a simple text file).

That fails for a number of reasons:

- it does not pass the 30,000-feet-high test

- integrity is not guaranteed (anybody can edit the files on the remote 
  file system, and nobody would realize that a "git checkout HEAD~2000" 
  ends up being something different from before)

- you would have to reinvent an efficient transfer (e.g. taking into 
  account all the data we have already)

- storage is no longer efficient, especially if you have multiple versions 
  of the same file.

- it is no longer decentralized anymore. Just think about yourself sitting 
  in the middle of antarctica, desperately needing to match a penguin 
  against a database of known penguins. You definitely want to have the 
  database local instead of leeching it down the non-existing wire all the 
  time. Likewise, if you and your group sit, say, on Viti Levu, and 
  develop software with people from New York, Texas, you definitely want
  a repository-in-the-middle, making it one person's duty to synchronize, 
  say, once per day.

I am sure you can think of more reasons.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-04-07  6:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-06  8:09 blobs (once more) Pau Garcia i Quiles
2011-04-06  9:25 ` Johannes Schindelin
2011-04-06 12:20   ` Michael J Gruber
2011-04-06 14:14   ` Martin Langhoff
2011-04-06 11:06 ` Matthieu Moy
2011-04-06 11:12   ` Peter Jönsson P
2011-04-06 16:42     ` Magnus Bäck
2011-04-07  5:20 ` Miles Bader
2011-04-07  6:45   ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).