git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Better big file support & GSoC
@ 2011-04-02 14:40 Jonathan Michalon
  2011-04-02 15:30 ` Carlos Martín Nieto
  2011-04-03  4:00 ` david
  0 siblings, 2 replies; 5+ messages in thread
From: Jonathan Michalon @ 2011-04-02 14:40 UTC (permalink / raw)
  To: git

Hi Git people,

I'm an applicant to the GSoC within git.
I would like to help building a better big file support mechanism.

I have read the latest threads on this topic:
http://thread.gmane.org/gmane.comp.version-control.git/165389/focus=165389
http://thread.gmane.org/gmane.comp.version-control.git/168403/focus=168852

Here's a compilation of what I read and what I think.

What come the most are OOM issues. But I think that the problem is, git tries
to work exactly the same on binaries and text. If we managed one way or another
to skip tasks (what "intelligent" operations are possible on binaries ?
Almost none...) we should be able to avoid them, like most of the time.
This means that a first step will be to introduce an autodetection mechanism.

Jeff King argues that, on binaries, we got uninteresting diffs, and compression
is often useless. I agree. We would better not compress any of them (okay, tons
of zeros would compress well but who's going to track zeroes?).

Eric Montellese says: "Don't track binaries in git. Track their hashes." I agree
here too. We should not treat computer data like source code (or whatever text).
He claims that he needs to handle repos containing source code + zipped tarballs
+ large and/or many binaries. Users seem to really need binary tracking and
therefore git should do it. I personally needed to a couple of times.

He also says that we could want to do download-as-needed and remove-unnecessary
operations, and I think that it may be clean enough to add a git command like
'git blob' to handle special operations for binaries. Perhaps in a second step.

Another idea was to create "sparse" repos, considered leafs as they may not be
cloned from because they lack full data. But it may or may not be in the
spirit of Git...


What I personally would like as a feature is the ability to store the main
repo with sources etc. into a conventional repo but put the data elsewhere
on a storage location. This would allow to develop programs which need data
to run (like textures in games etc.) without making the repo slow, big or
just messy.
I faced the situation on TuxFamily where the website, Git/SVN etc. are on one
quick server and the download area on another one. The first was limited to
something like 100MB and the second to 1GB, extensible to more if needed.
On the same idea, on my home server with multiple OpenVZ containers I host repos
for my projects on one free-to-access container which may be attacked, or even
compromised which has a small disk partition. On the other side my data is on a
ssh-only, secured, firewalled big partition. It may be useful to have code on
the first but ssh'd data on the other.
I suspect many other situations where a separation between code and data may
help administrators and/or users.
To handle this I thought of a mechanism allowing a sort of branch (to make use
of multiple 'remote') to be checked out at the same time as the code...
In addition we should use an extensible protocol layer to manage data.
git-annex or git-media which already address some of the problems here
are using various things like ssh, http, s3. And I just saw that Debian's git
package already recommend rsync.


What do you think about that whole? Would it fit on a GSoC background? Great
interesting task indeed. May sound too long. But of course if the summer went
too short I would not drop the project on the floor as soon as the time limit
will be reached.


Best regards,

--
Jonathan Michalon

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-04-04 16:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-02 14:40 Better big file support & GSoC Jonathan Michalon
2011-04-02 15:30 ` Carlos Martín Nieto
2011-04-04 16:53   ` Jonathan Michalon
2011-04-03  4:00 ` david
2011-04-04 16:52   ` Jonathan Michalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).