Better big file support & GSoC

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Better big file support & GSoC
@ 2011-04-02 14:40 Jonathan Michalon
  2011-04-02 15:30 ` Carlos Martín Nieto
  2011-04-03  4:00 ` david
  0 siblings, 2 replies; 5+ messages in thread
From: Jonathan Michalon @ 2011-04-02 14:40 UTC (permalink / raw)
  To: git

Hi Git people,

I'm an applicant to the GSoC within git.
I would like to help building a better big file support mechanism.

I have read the latest threads on this topic:
http://thread.gmane.org/gmane.comp.version-control.git/165389/focus=165389
http://thread.gmane.org/gmane.comp.version-control.git/168403/focus=168852

Here's a compilation of what I read and what I think.

What come the most are OOM issues. But I think that the problem is, git tries
to work exactly the same on binaries and text. If we managed one way or another
to skip tasks (what "intelligent" operations are possible on binaries ?
Almost none...) we should be able to avoid them, like most of the time.
This means that a first step will be to introduce an autodetection mechanism.

Jeff King argues that, on binaries, we got uninteresting diffs, and compression
is often useless. I agree. We would better not compress any of them (okay, tons
of zeros would compress well but who's going to track zeroes?).

Eric Montellese says: "Don't track binaries in git. Track their hashes." I agree
here too. We should not treat computer data like source code (or whatever text).
He claims that he needs to handle repos containing source code + zipped tarballs
+ large and/or many binaries. Users seem to really need binary tracking and
therefore git should do it. I personally needed to a couple of times.

He also says that we could want to do download-as-needed and remove-unnecessary
operations, and I think that it may be clean enough to add a git command like
'git blob' to handle special operations for binaries. Perhaps in a second step.

Another idea was to create "sparse" repos, considered leafs as they may not be
cloned from because they lack full data. But it may or may not be in the
spirit of Git...

What I personally would like as a feature is the ability to store the main
repo with sources etc. into a conventional repo but put the data elsewhere
on a storage location. This would allow to develop programs which need data
to run (like textures in games etc.) without making the repo slow, big or
just messy.
I faced the situation on TuxFamily where the website, Git/SVN etc. are on one
quick server and the download area on another one. The first was limited to
something like 100MB and the second to 1GB, extensible to more if needed.
On the same idea, on my home server with multiple OpenVZ containers I host repos
for my projects on one free-to-access container which may be attacked, or even
compromised which has a small disk partition. On the other side my data is on a
ssh-only, secured, firewalled big partition. It may be useful to have code on
the first but ssh'd data on the other.
I suspect many other situations where a separation between code and data may
help administrators and/or users.
To handle this I thought of a mechanism allowing a sort of branch (to make use
of multiple 'remote') to be checked out at the same time as the code...
In addition we should use an extensible protocol layer to manage data.
git-annex or git-media which already address some of the problems here
are using various things like ssh, http, s3. And I just saw that Debian's git
package already recommend rsync.

What do you think about that whole? Would it fit on a GSoC background? Great
interesting task indeed. May sound too long. But of course if the summer went
too short I would not drop the project on the floor as soon as the time limit
will be reached.

Best regards,

--
Jonathan Michalon

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Better big file support & GSoC
  2011-04-02 14:40 Better big file support & GSoC Jonathan Michalon
@ 2011-04-02 15:30 ` Carlos Martín Nieto
  2011-04-04 16:53   ` Jonathan Michalon
  2011-04-03  4:00 ` david
  1 sibling, 1 reply; 5+ messages in thread
From: Carlos Martín Nieto @ 2011-04-02 15:30 UTC (permalink / raw)
  To: Jonathan Michalon; +Cc: git

On Sat, Apr 02, 2011 at 04:40:51PM +0200, Jonathan Michalon wrote:
> [...]
> Eric Montellese says: "Don't track binaries in git. Track their hashes." I agree
> here too. We should not treat computer data like source code (or whatever text).
> He claims that he needs to handle repos containing source code + zipped tarballs
> + large and/or many binaries. Users seem to really need binary tracking and
> therefore git should do it. I personally needed to a couple of times.
> 
> He also says that we could want to do download-as-needed and remove-unnecessary
> operations, and I think that it may be clean enough to add a git command like
> 'git blob' to handle special operations for binaries. Perhaps in a second step.
> 
> Another idea was to create "sparse" repos, considered leafs as they may not be
> cloned from because they lack full data. But it may or may not be in the
> spirit of Git...
> 
> 
> What I personally would like as a feature is the ability to store the main
> repo with sources etc. into a conventional repo but put the data elsewhere
> on a storage location. This would allow to develop programs which need data
> to run (like textures in games etc.) without making the repo slow, big or
> just messy.

 This sounds a lot like like what git-annex [0] does. Maybe
 integrating its functionality with mainline git could be a good
 start.

[0] http://git-annex.branchable.com/

   cmn
-- 
Carlos Martín Nieto | http://cmartin.tk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Better big file support & GSoC
  2011-04-02 14:40 Better big file support & GSoC Jonathan Michalon
  2011-04-02 15:30 ` Carlos Martín Nieto
@ 2011-04-03  4:00 ` david
  2011-04-04 16:52   ` Jonathan Michalon
  1 sibling, 1 reply; 5+ messages in thread
From: david @ 2011-04-03  4:00 UTC (permalink / raw)
  To: Jonathan Michalon; +Cc: git

On Sat, 2 Apr 2011, Jonathan Michalon wrote:

> Hi Git people,
>
> I'm an applicant to the GSoC within git.
> I would like to help building a better big file support mechanism.
>
> I have read the latest threads on this topic:
> http://thread.gmane.org/gmane.comp.version-control.git/165389/focus=165389
> http://thread.gmane.org/gmane.comp.version-control.git/168403/focus=168852

there was also an offshoot of a similar discussion that pointed out that 
this could be done pretty cleanly with the clean/smudge hooks.

David Lang

> Here's a compilation of what I read and what I think.
>
> What come the most are OOM issues. But I think that the problem is, git tries
> to work exactly the same on binaries and text. If we managed one way or another
> to skip tasks (what "intelligent" operations are possible on binaries ?
> Almost none...) we should be able to avoid them, like most of the time.
> This means that a first step will be to introduce an autodetection mechanism.
>
> Jeff King argues that, on binaries, we got uninteresting diffs, and compression
> is often useless. I agree. We would better not compress any of them (okay, tons
> of zeros would compress well but who's going to track zeroes?).
>
> Eric Montellese says: "Don't track binaries in git. Track their hashes." I agree
> here too. We should not treat computer data like source code (or whatever text).
> He claims that he needs to handle repos containing source code + zipped tarballs
> + large and/or many binaries. Users seem to really need binary tracking and
> therefore git should do it. I personally needed to a couple of times.
>
> He also says that we could want to do download-as-needed and remove-unnecessary
> operations, and I think that it may be clean enough to add a git command like
> 'git blob' to handle special operations for binaries. Perhaps in a second step.
>
> Another idea was to create "sparse" repos, considered leafs as they may not be
> cloned from because they lack full data. But it may or may not be in the
> spirit of Git...
>
>
> What I personally would like as a feature is the ability to store the main
> repo with sources etc. into a conventional repo but put the data elsewhere
> on a storage location. This would allow to develop programs which need data
> to run (like textures in games etc.) without making the repo slow, big or
> just messy.
> I faced the situation on TuxFamily where the website, Git/SVN etc. are on one
> quick server and the download area on another one. The first was limited to
> something like 100MB and the second to 1GB, extensible to more if needed.
> On the same idea, on my home server with multiple OpenVZ containers I host repos
> for my projects on one free-to-access container which may be attacked, or even
> compromised which has a small disk partition. On the other side my data is on a
> ssh-only, secured, firewalled big partition. It may be useful to have code on
> the first but ssh'd data on the other.
> I suspect many other situations where a separation between code and data may
> help administrators and/or users.
> To handle this I thought of a mechanism allowing a sort of branch (to make use
> of multiple 'remote') to be checked out at the same time as the code...
> In addition we should use an extensible protocol layer to manage data.
> git-annex or git-media which already address some of the problems here
> are using various things like ssh, http, s3. And I just saw that Debian's git
> package already recommend rsync.
>
>
> What do you think about that whole? Would it fit on a GSoC background? Great
> interesting task indeed. May sound too long. But of course if the summer went
> too short I would not drop the project on the floor as soon as the time limit
> will be reached.
>
>
> Best regards,
>
> --
> Jonathan Michalon
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Better big file support & GSoC
  2011-04-03  4:00 ` david
@ 2011-04-04 16:52   ` Jonathan Michalon
  0 siblings, 0 replies; 5+ messages in thread
From: Jonathan Michalon @ 2011-04-04 16:52 UTC (permalink / raw)
  Cc: git

Le Sat, 2 Apr 2011 21:00:53 -0700 (PDT),
david@lang.hm a écrit :

> On Sat, 2 Apr 2011, Jonathan Michalon wrote:
> 
> > Hi Git people,
> >
> > I'm an applicant to the GSoC within git.
> > I would like to help building a better big file support mechanism.
> >
> > I have read the latest threads on this topic:
> > http://thread.gmane.org/gmane.comp.version-control.git/165389/focus=165389
> > http://thread.gmane.org/gmane.comp.version-control.git/168403/focus=168852
> 
> there was also an offshoot of a similar discussion that pointed out that 
> this could be done pretty cleanly with the clean/smudge hooks.
> 
> David Lang

Edit:
Hum I just failed to reply correcly... did only to the original poster, not the
whole list. My apologies.

Message:
At least to my mind big file support is more important than just doing some
tricky manipulation with existing hook types. It should highly benefit from
being integrated deeply within git, both because of optimisations or in terms
of integration.
I read the discussion about clean/smudge hooks too but I skipped the idea
because the final thought was "feels like a hack". See here:
http://article.gmane.org/gmane.comp.version-control.git/168857

In fact I don't know how this could be considered as "clean" or "hacky"…

--
Jonathan Michalon

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Better big file support & GSoC
  2011-04-02 15:30 ` Carlos Martín Nieto
@ 2011-04-04 16:53   ` Jonathan Michalon
  0 siblings, 0 replies; 5+ messages in thread
From: Jonathan Michalon @ 2011-04-04 16:53 UTC (permalink / raw)
  Cc: git

Le Sat, 2 Apr 2011 17:30:15 +0200,
Carlos Martín Nieto <cmn@elego.de> a écrit :
> On Sat, Apr 02, 2011 at 04:40:51PM +0200, Jonathan Michalon wrote:
> > What I personally would like as a feature is the ability to store the main
> > repo with sources etc. into a conventional repo but put the data elsewhere
> > on a storage location. This would allow to develop programs which need data
> > to run (like textures in games etc.) without making the repo slow, big or
> > just messy.
> 
>  This sounds a lot like like what git-annex [0] does. Maybe
>  integrating its functionality with mainline git could be a good
>  start.
> 
> [0] http://git-annex.branchable.com/
> 
>    cmn

Edit:
Hum I just failed to reply correcly... did only to the original poster, not the
whole list. My apologies.

Message:
Yes, for sure. I will try to reuse as much code as possible, and digging into
some which does almost the job will help. But in fact I doubt that it will be
very comparable both as separate software and integrated code.
In addition Eric Montellese dug already into the code but was not completely
satisfied. See: http://article.gmane.org/gmane.comp.version-control.git/165395

I would like to have the opinion of the community before going in the wrong
direction.

--
Jonathan Michalon

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-04-04 16:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-02 14:40 Better big file support & GSoC Jonathan Michalon
2011-04-02 15:30 ` Carlos Martín Nieto
2011-04-04 16:53   ` Jonathan Michalon
2011-04-03  4:00 ` david
2011-04-04 16:52   ` Jonathan Michalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).