Re: blobs (once more) - Michael J Gruber

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Michael J Gruber <git@drmicha.warpmail.net>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Pau Garcia i Quiles <pgquiles@elpauer.org>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: blobs (once more)
Date: Wed, 06 Apr 2011 14:20:09 +0200	[thread overview]
Message-ID: <4D9C5A79.30207@drmicha.warpmail.net> (raw)
In-Reply-To: <alpine.DEB.1.00.1104061121000.2040@bonsai2>

Johannes Schindelin venit, vidit, dixit 06.04.2011 11:25:
> Hi,
> 
> On Wed, 6 Apr 2011, Pau Garcia i Quiles wrote:
> 
>> Binary large objects. I know it has been discussed once and again but 
>> I'd like to know if there is something new.
>>
>> Some corporation hired the company I work for one year ago to develop a 
>> large application. They imposed ClearCase as the VCS. I don't know if 
>> you have used it but it is a pain in the ass. We have lost weeks of 
>> development to site-replication problems, funny merges, etc. We are 
>> trying to migrate our project to git, which we have experience with.
>>
>> One very important point in this project (which is Windows only) is 
>> putting binaries in the repository. So far, we have suceeded in not 
>> doing that in other projects but we will need to do that in this 
>> project.
>>
>> In the Windows world, it is not unusual to use third-party libraries 
>> which are only available in binary form. Getting them as source is not 
>> an option because the companies developing them are not selling the 
>> source. Moving from those binary-only dependencies to something else is 
>> not an option either because what we are using has some unique features, 
>> be it technical features or support features. In our project, we have 
>> about a dozen such binaries, ranging from a few hundred kilobytes, to a 
>> couple hundred megabytes (proprietary database and virtualization 
>> engine).
>>
>> The usual answer to the "I need to put binaries in the repository" 
>> question has been "no, you do not". Well, we do. We are in heavy 
>> development now, therefore today's version may depend on a certain 
>> version of a third-party shared library (DLL) which we only can get in 
>> binary form, and tomorrow's version may depend on the next version of 
>> that library, and you cannot mix today's source with yesterday's 
>> third-party DLL. I. e. to be able to use the code from 7 days ago at 
>> 11.07 AM you need "git checkout" to "return" our source AND the binaries 
>> we were using back then. This is something ClearCase manages 
>> satisfactorily.
> 
> I understand. The problem in your case might not be too bad, after all. 
> The problem only arises when you have big files that are compressed. If 
> you check in multiple versions of an uncompressed .dll file, Git will 
> usually do a very good job at compressing them.
> 
> If they are compressed, what you probably need is something like a sparse 
> clone, which is sort of available in the form of shallow clones, but it is 
> too limited still.
> 
> Having said that, in another company I work for, they hav 20G repositories 
> and they will grow larger. That is something they incurred due to 
> historical reasons, and they are willing to pay the price in terms of disk 
> space. Due to Git's distributed nature, they had no problems with cloning; 
> they just use a local reference upon initial clone.
> 
>> I have read about:
>> - submodules + using different repositories once one "blob repository"  
>>   grows too much. This will be probably rejected because it is quite 
>>   contrived.
> 
> I would also recommend against this, because submodules are a very weak 
> part of Git.
> 
>> - git-annex (does not get the files in when cloning, pulling, checking 
>>   out; you need to do it manually)
>> - git-media (same as git-annex)
> 
> Yes, this is an option, but a bit klunky.
> 
>> - boar (no, we do not want to use a VCS for binaries in addition to git)
> 
> I did not know about that.
> 
>> - and a few more
>>
>> So far the only good solution seems to be git-bigfiles but it's still
>> in development.
> 
> It has stalled, apparently, but I wanted to have a look at it anyway. Will 
> let you know of my findings!

I think in many applications the "download-on-demand" approach which
git-annex takes is very important. (I don't know how far our
sparse/shallow supports this.) Also, their remote backends look
interesting. And no, I don't want Haskell as yet another language for
our code base.

Fedora handles big files (compressed tar balls) in git with a file
store, scripting (fedpkg) and tracking only a text file with hash values
("sources") in git; somehow a baby version of git-annex.

The symlink based approach of annex (big file is a symlink to the
"object store" which is indexed by blob content sha1) reminds me very
much of our notes trees and the way textconv-cache uses it. It feels as
if we already have all the pieces in place. (I don't think we need to
track big files' contents, only their hashes; this is fast for read-only
media, see annex' worm-backend.)

Another crazy idea would be to "git replace" big files by place-holders
(blob with the big file's sha1 as content) or rather the other way
round, but I haven't thought this through.

Michael

next prev parent reply	other threads:[~2011-04-06 12:23 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-06  8:09 blobs (once more) Pau Garcia i Quiles
2011-04-06  9:25 ` Johannes Schindelin
2011-04-06 12:20   ` Michael J Gruber [this message]
2011-04-06 14:14   ` Martin Langhoff
2011-04-06 11:06 ` Matthieu Moy
2011-04-06 11:12   ` Peter Jönsson P
2011-04-06 16:42     ` Magnus Bäck
2011-04-07  5:20 ` Miles Bader
2011-04-07  6:45   ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D9C5A79.30207@drmicha.warpmail.net \
    --to=git@drmicha.warpmail.net \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=pgquiles@elpauer.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.