Problem with large files on different OSes

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Problem with large files on different OSes
@ 2009-05-27 10:52 Christopher Jefferson
  2009-05-27 11:37 ` Andreas Ericsson
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Christopher Jefferson @ 2009-05-27 10:52 UTC (permalink / raw)
  To: git

I recently came across a very annoying problem, characterised by the  
following example:

On a recent ubuntu install:

dd if=/dev/zero of=file bs=1300k count=1k
git commit file -m "Add huge file"

The repository can be pulled and pushed successfully to other ubuntu  
installs, but on Mac OS X, 10.5.7 machine with 4GB ram git pull  
produces:

remote: Counting objects: 6, done.
remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) failed  
(error code=12)
remote: *** error: can't allocate region
remote: *** set a breakpoint in malloc_error_break to debug
remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) failed  
(error code=12)
remote: *** error: can't allocate region
remote: *** set a breakpoint in malloc_error_break to debug
remote: fatal: Out of memory, malloc failed
error: git upload-pack: git-pack-objects died with error.
fatal: git upload-pack: aborting due to possible repository corruption  
on the remote side.
remote: aborting due to possible repository corruption on the remote  
side.
fatal: protocol error: bad pack header

The problem appears to be the different maximum mmap sizes available  
on different OSes. Whic I don't really mind the maximum file size  
restriction git imposes, this restriction varying from OS to OS is  
very annoying, fixing this required rewriting history to remove the  
commit, which caused problems as the commit had already been pulled,  
and built on, by a number of developers.

If the requirement that all files can be mmapped cannot be easily  
removed, would be it perhaps be acceptable to impose a (soft?)  
1GB(ish) file size limit? I suggest 1GB as all the OSes I can get hold  
of easily (freeBSD, windows, Mac OS X, linux) support a mmap of size >  
1GB.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson
@ 2009-05-27 11:37 ` Andreas Ericsson
  2009-05-27 13:02   ` Christopher Jefferson
  2009-05-27 13:28   ` John Tapsell
  2009-05-27 14:01 ` Tomas Carnecky
  2009-05-27 14:37 ` Jakub Narebski
  2 siblings, 2 replies; 31+ messages in thread
From: Andreas Ericsson @ 2009-05-27 11:37 UTC (permalink / raw)
  To: Christopher Jefferson; +Cc: git

Christopher Jefferson wrote:
> I recently came across a very annoying problem, characterised by the 
> following example:
> 
> On a recent ubuntu install:
> 
> dd if=/dev/zero of=file bs=1300k count=1k
> git commit file -m "Add huge file"
> 
> 
> The repository can be pulled and pushed successfully to other ubuntu 
> installs, but on Mac OS X, 10.5.7 machine with 4GB ram git pull produces:
> 
> remote: Counting objects: 6, done.
> remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) failed 
> (error code=12)
> remote: *** error: can't allocate region
> remote: *** set a breakpoint in malloc_error_break to debug
> remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) failed 
> (error code=12)
> remote: *** error: can't allocate region
> remote: *** set a breakpoint in malloc_error_break to debug
> remote: fatal: Out of memory, malloc failed
> error: git upload-pack: git-pack-objects died with error.
> fatal: git upload-pack: aborting due to possible repository corruption 
> on the remote side.
> remote: aborting due to possible repository corruption on the remote side.
> fatal: protocol error: bad pack header
> 
> 
> The problem appears to be the different maximum mmap sizes available on 
> different OSes. Whic I don't really mind the maximum file size 
> restriction git imposes, this restriction varying from OS to OS is very 
> annoying, fixing this required rewriting history to remove the commit, 
> which caused problems as the commit had already been pulled, and built 
> on, by a number of developers.
> 
> If the requirement that all files can be mmapped cannot be easily 
> removed, would be it perhaps be acceptable to impose a (soft?) 1GB(ish) 
> file size limit?

Most definitely not. Why should we limit a cross-platform system for
the benefit of one particular developer's lacking hardware?

Such a convention should, if anything, be enforced by social policy,
but not by the tool itself.

Otherwise, why not just restrict the tool that created the huge file
so that it makes smaller files that fit into git on all platforms
instead? (No, that wasn't a real suggestion. It was just to make the
point that your suggestion for git to impose artificial limits is
equally ludicrous)

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 11:37 ` Andreas Ericsson
@ 2009-05-27 13:02   ` Christopher Jefferson
  2009-05-27 13:28   ` John Tapsell
  1 sibling, 0 replies; 31+ messages in thread
From: Christopher Jefferson @ 2009-05-27 13:02 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git


On 27 May 2009, at 12:37, Andreas Ericsson wrote:

> Christopher Jefferson wrote:
>> I recently came across a very annoying problem, characterised by  
>> the following example:
>> On a recent ubuntu install:
>> dd if=/dev/zero of=file bs=1300k count=1k
>> git commit file -m "Add huge file"
>> The repository can be pulled and pushed successfully to other  
>> ubuntu installs, but on Mac OS X, 10.5.7 machine with 4GB ram git  
>> pull produces:
>> remote: Counting objects: 6, done.
>> remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896)  
>> failed (error code=12)
>> remote: *** error: can't allocate region
>> remote: *** set a breakpoint in malloc_error_break to debug
>> remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896)  
>> failed (error code=12)
>> remote: *** error: can't allocate region
>> remote: *** set a breakpoint in malloc_error_break to debug
>> remote: fatal: Out of memory, malloc failed
>> error: git upload-pack: git-pack-objects died with error.
>> fatal: git upload-pack: aborting due to possible repository  
>> corruption on the remote side.
>> remote: aborting due to possible repository corruption on the  
>> remote side.
>> fatal: protocol error: bad pack header
>> The problem appears to be the different maximum mmap sizes  
>> available on different OSes. Whic I don't really mind the maximum  
>> file size restriction git imposes, this restriction varying from OS  
>> to OS is very annoying, fixing this required rewriting history to  
>> remove the commit, which caused problems as the commit had already  
>> been pulled, and built on, by a number of developers.
>> If the requirement that all files can be mmapped cannot be easily  
>> removed, would be it perhaps be acceptable to impose a (soft?)  
>> 1GB(ish) file size limit?
>
> Most definitely not. Why should we limit a cross-platform system for
> the benefit of one particular developer's lacking hardware?

Out of curiosity, why do you say lacking hardware? I am running  
ubuntu, windows and Mac OS X on exactly the same machine, which is not  
running out of physical memory, never mind swap, when using git on any  
OS. The problem is purely a software (and OS) problem.


Chris

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 11:37 ` Andreas Ericsson
  2009-05-27 13:02   ` Christopher Jefferson
@ 2009-05-27 13:28   ` John Tapsell
  2009-05-27 13:30     ` Christopher Jefferson
  1 sibling, 1 reply; 31+ messages in thread
From: John Tapsell @ 2009-05-27 13:28 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Christopher Jefferson, git

2009/5/27 Andreas Ericsson <ae@op5.se>:
> Christopher Jefferson wrote:
>> If the requirement that all files can be mmapped cannot be easily removed,
>> would be it perhaps be acceptable to impose a (soft?) 1GB(ish) file size
>> limit?
>
> Most definitely not. Why should we limit a cross-platform system for
> the benefit of one particular developer's lacking hardware?

Perhaps a simple warning would suffice  "Warning: Files larger than
2GB may cause problems when trying to checkout on Windows."

John

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 13:28   ` John Tapsell
@ 2009-05-27 13:30     ` Christopher Jefferson
  2009-05-27 13:32       ` John Tapsell
  0 siblings, 1 reply; 31+ messages in thread
From: Christopher Jefferson @ 2009-05-27 13:30 UTC (permalink / raw)
  To: John Tapsell; +Cc: Andreas Ericsson, git


On 27 May 2009, at 14:28, John Tapsell wrote:

> 2009/5/27 Andreas Ericsson <ae@op5.se>:
>> Christopher Jefferson wrote:
>>> If the requirement that all files can be mmapped cannot be easily  
>>> removed,
>>> would be it perhaps be acceptable to impose a (soft?) 1GB(ish)  
>>> file size
>>> limit?
>>
>> Most definitely not. Why should we limit a cross-platform system for
>> the benefit of one particular developer's lacking hardware?
>
> Perhaps a simple warning would suffice  "Warning: Files larger than
> 2GB may cause problems when trying to checkout on Windows."
>

Something like that, except that limit seems to be only 1.3GB on Mac  
OS X

Chris

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 13:30     ` Christopher Jefferson
@ 2009-05-27 13:32       ` John Tapsell
  0 siblings, 0 replies; 31+ messages in thread
From: John Tapsell @ 2009-05-27 13:32 UTC (permalink / raw)
  To: Christopher Jefferson; +Cc: Andreas Ericsson, git

2009/5/27 Christopher Jefferson <caj@cs.st-andrews.ac.uk>:
> Something like that, except that limit seems to be only 1.3GB on Mac OS X

Does linux have a similar limitation, lower than the limit imposed by
the filesystem?
Could this be solved by having a fallback solution for mmap?
(switching to opening the file normally)  Or would this fallback be
too intrusive/large of a change?

John

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson
  2009-05-27 11:37 ` Andreas Ericsson
@ 2009-05-27 14:01 ` Tomas Carnecky
  2009-05-27 14:09   ` Christopher Jefferson
  2009-05-27 14:37 ` Jakub Narebski
  2 siblings, 1 reply; 31+ messages in thread
From: Tomas Carnecky @ 2009-05-27 14:01 UTC (permalink / raw)
  To: Christopher Jefferson; +Cc: git


On May 27, 2009, at 12:52 PM, Christopher Jefferson wrote:

> I recently came across a very annoying problem, characterised by the  
> following example:
>
> On a recent ubuntu install:
>
> dd if=/dev/zero of=file bs=1300k count=1k
> git commit file -m "Add huge file"
>
>
> The repository can be pulled and pushed successfully to other ubuntu  
> installs, but on Mac OS X, 10.5.7 machine with 4GB ram git pull  
> produces:
>
> remote: Counting objects: 6, done.
> remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896)  
> failed (error code=12)
> remote: *** error: can't allocate region
> remote: *** set a breakpoint in malloc_error_break to debug
> remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896)  
> failed (error code=12)
> remote: *** error: can't allocate region
> remote: *** set a breakpoint in malloc_error_break to debug
> remote: fatal: Out of memory, malloc failed
> error: git upload-pack: git-pack-objects died with error.
> fatal: git upload-pack: aborting due to possible repository  
> corruption on the remote side.
> remote: aborting due to possible repository corruption on the remote  
> side.
> fatal: protocol error: bad pack header
>
>
> The problem appears to be the different maximum mmap sizes available  
> on different OSes. Whic I don't really mind the maximum file size  
> restriction git imposes, this restriction varying from OS to OS is  
> very annoying, fixing this required rewriting history to remove the  
> commit, which caused problems as the commit had already been pulled,  
> and built on, by a number of developers.
>
> If the requirement that all files can be mmapped cannot be easily  
> removed, would be it perhaps be acceptable to impose a (soft?)  
> 1GB(ish) file size limit? I suggest 1GB as all the OSes I can get  
> hold of easily (freeBSD, windows, Mac OS X, linux) support a mmap of  
> size > 1GB.

I think this is a limitation of a 32bit build of git. I just tried  
with a 64bit build and it added the file just fine. The compiler on  
MacOSX (gcc) produces 32bit builds by default, even if the system  
supports 64bit executables. But gcc on 64bit Linux (at least the  
installations I have at home) produces a 64bit executables by default.  
Solaris/OpenSolaris behaves like MacOSX, no idea about *BSD or  
Windows. Maybe this is why git works on Linux but not MacOSX even on  
the same hardware.
Btw, I built git with: make install prefix=... CC="gcc -m64", no  
modifications needed (MacOSX 10.5.7).

tom

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 14:01 ` Tomas Carnecky
@ 2009-05-27 14:09   ` Christopher Jefferson
  2009-05-27 14:22     ` Andreas Ericsson
  0 siblings, 1 reply; 31+ messages in thread
From: Christopher Jefferson @ 2009-05-27 14:09 UTC (permalink / raw)
  To: Tomas Carnecky; +Cc: git


On 27 May 2009, at 15:01, Tomas Carnecky wrote:

>>
>> The problem appears to be the different maximum mmap sizes  
>> available on different OSes. Whic I don't really mind the maximum  
>> file size restriction git imposes, this restriction varying from OS  
>> to OS is very annoying, fixing this required rewriting history to  
>> remove the commit, which caused problems as the commit had already  
>> been pulled, and built on, by a number of developers.
>>
>> If the requirement that all files can be mmapped cannot be easily  
>> removed, would be it perhaps be acceptable to impose a (soft?)  
>> 1GB(ish) file size limit? I suggest 1GB as all the OSes I can get  
>> hold of easily (freeBSD, windows, Mac OS X, linux) support a mmap  
>> of size > 1GB.
>
> I think this is a limitation of a 32bit build of git. I just tried  
> with a 64bit build and it added the file just fine. The compiler on  
> MacOSX (gcc) produces 32bit builds by default, even if the system  
> supports 64bit executables. But gcc on 64bit Linux (at least the  
> installations I have at home) produces a 64bit executables by  
> default. Solaris/OpenSolaris behaves like MacOSX, no idea about *BSD  
> or Windows. Maybe this is why git works on Linux but not MacOSX even  
> on the same hardware.
> Btw, I built git with: make install prefix=... CC="gcc -m64", no  
> modifications needed (MacOSX 10.5.7).

The git installs I am using are all 32bit, this machine doesn't have a  
64bit processor (it is one of the few macs released without one). It's  
nice to know long term this problem will go away, that all suggests  
introducing some limit is not approriate, as while 32bit users have  
some arbitary limit above which they cannot go, I am sure all 64-bit  
OSes will manage to easily mmap any file. Of course warning such users  
they are producing packs that are not going to work on 32bit compiles  
of git isn't a stupid idea.

Chris

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 14:09   ` Christopher Jefferson
@ 2009-05-27 14:22     ` Andreas Ericsson
  0 siblings, 0 replies; 31+ messages in thread
From: Andreas Ericsson @ 2009-05-27 14:22 UTC (permalink / raw)
  To: Christopher Jefferson; +Cc: Tomas Carnecky, git

Christopher Jefferson wrote:
> 
> On 27 May 2009, at 15:01, Tomas Carnecky wrote:
> 
>>>
>>> The problem appears to be the different maximum mmap sizes available 
>>> on different OSes. Whic I don't really mind the maximum file size 
>>> restriction git imposes, this restriction varying from OS to OS is 
>>> very annoying, fixing this required rewriting history to remove the 
>>> commit, which caused problems as the commit had already been pulled, 
>>> and built on, by a number of developers.
>>>
>>> If the requirement that all files can be mmapped cannot be easily 
>>> removed, would be it perhaps be acceptable to impose a (soft?) 
>>> 1GB(ish) file size limit? I suggest 1GB as all the OSes I can get 
>>> hold of easily (freeBSD, windows, Mac OS X, linux) support a mmap of 
>>> size > 1GB.
>>
>> I think this is a limitation of a 32bit build of git. I just tried 
>> with a 64bit build and it added the file just fine. The compiler on 
>> MacOSX (gcc) produces 32bit builds by default, even if the system 
>> supports 64bit executables. But gcc on 64bit Linux (at least the 
>> installations I have at home) produces a 64bit executables by default. 
>> Solaris/OpenSolaris behaves like MacOSX, no idea about *BSD or 
>> Windows. Maybe this is why git works on Linux but not MacOSX even on 
>> the same hardware.
>> Btw, I built git with: make install prefix=... CC="gcc -m64", no 
>> modifications needed (MacOSX 10.5.7).
> 
> The git installs I am using are all 32bit, this machine doesn't have a 
> 64bit processor (it is one of the few macs released without one). It's 
> nice to know long term this problem will go away, that all suggests 
> introducing some limit is not approriate, as while 32bit users have some 
> arbitary limit above which they cannot go, I am sure all 64-bit OSes 
> will manage to easily mmap any file. Of course warning such users they 
> are producing packs that are not going to work on 32bit compiles of git 
> isn't a stupid idea.
> 

mmap()'ing large files (> 4GB) work just fine on Linux. You can't mmap()
more than 4GB at a time though (I think; I didn't try), but since we
don't do that anyway I doubt that was the problem.

The file you produced with your dd command should have ended up being
1239MB, or 1.21GB, so the real hard limit for MacOSX seem to be 1GB
if, indeed, there is one. On the other hand, the error message you
got ("fatal: Out of memory, malloc failed") seems to indicate the
system actually had no memory left when you tried to garbage-collect
your repository. Are you using a dual-core system? If so, please try
again with

   pack.threads = 1

set in the .git/config file of that particular repository. Each
thread will allocate roughly the same amount of memory, so if
both of them had to handle that huge blob at the same time, they'd
have exploded memory usage up to 1.3GB + the compressed size of
them + DAG-bookkeeping etc etc.

I'm guessing we'd have seen error reports from other OSX users
if it was actually impossible to mmap() 1GB files in git on OSX.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson
  2009-05-27 11:37 ` Andreas Ericsson
  2009-05-27 14:01 ` Tomas Carnecky
@ 2009-05-27 14:37 ` Jakub Narebski
  2009-05-27 16:30   ` Linus Torvalds
  2 siblings, 1 reply; 31+ messages in thread
From: Jakub Narebski @ 2009-05-27 14:37 UTC (permalink / raw)
  To: Christopher Jefferson; +Cc: git

Christopher Jefferson <caj@cs.st-andrews.ac.uk> writes:

> I recently came across a very annoying problem, characterised by the
> following example:
> 
> On a recent ubuntu install:
> 
> dd if=/dev/zero of=file bs=1300k count=1k
> git commit file -m "Add huge file"
> 
> 
> The repository can be pulled and pushed successfully to other ubuntu
> installs, but on Mac OS X, 10.5.7 machine with 4GB ram git pull
> produces:

Do seting `pack.packSizeLimit`, or adjusting values of
`core.packedGitWindowSize` and/or `core.packedGitLimit`
(see git-config(1)) help in your situation?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 14:37 ` Jakub Narebski
@ 2009-05-27 16:30   ` Linus Torvalds
  2009-05-27 16:59     ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2009-05-27 16:30 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Christopher Jefferson, git

On Wed, 27 May 2009, Jakub Narebski wrote:
> 
> Do seting `pack.packSizeLimit`, or adjusting values of
> `core.packedGitWindowSize` and/or `core.packedGitLimit`
> (see git-config(1)) help in your situation?

No, that will help just the packfile mmap (and even there, it won't help 
with things like index file size - we'll always mmap the whole index 
file). It's definitely worth doing, though - but I think we already 
default to 32MB pack-file windows on 32-bit architectures.

Individual files we always handle in one go. It's what git was designed 
for, after all - fairly small files. And so git is limited to files 
smaller than the virtual address space.

On a 32-bit setup, that often limits you to roughly a gigabyte. You have 
4GB of virtual address space, of which one or two is used for the OS 
kernel. So say you have 2GB for user mode - you then have the executable 
mapping and libraries and stack, all spread out in that 2GB virtual 
address space.

In fact, even if it's 3GB for user (I don't know what OS X does), getting 
one contiguous area may well be limited to ~1GB depending on layout of 
shared library mappings etc VM fragmentation. Older Linux systems tended 
to map things in ways that made it hard to get more than 1GB of contiguous 
data mapping if you compiled with dynamic libraries.

64-bit mostly makes this a non-issue.

In fact, if you do "diff", you're going to be even _more_ limited, since 
for simplicity, our version of xdiff really wants both sources in memory 
at a time. So you can't really diff >500MB files. Not that you generally 
want to, of course.

I'll see if I can make us handle the "big file without diff" case better 
by chunking.

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 16:30   ` Linus Torvalds
@ 2009-05-27 16:59     ` Linus Torvalds
  2009-05-27 17:22       ` Christopher Jefferson
  2009-05-27 17:37       ` Nicolas Pitre
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2009-05-27 16:59 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Christopher Jefferson, git

On Wed, 27 May 2009, Linus Torvalds wrote:
> 
> I'll see if I can make us handle the "big file without diff" case better 
> by chunking.

Hmm. No. Looking at it some more, we could add some nasty code to do 
_some_ things chunked (like adding a new file as a single object), but it 
doesn't really help. For any kind of useful thing, we'd need to handle the 
"read from pack" case in multiple chunks too, and that gets really nasty 
really quickly.

The whole "each object as one allocation" design is pretty core, and it 
looks pointless to have a few special cases, when any actual relevant use 
would need a whole lot more than the few simple ones.

Git really doesn't like big individual objects.

I've occasionally thought about handling big files as multiple big 
objects: we'd split them into a "pseudo-directory" (it would have some new 
object ID), and then treat them as a magical special kind of directory 
that just happens to be represented as one large file on the filesystem.

That would mean that if you have a huge file, git internally would never 
think of it as one big file, but as a collection of many smaller objects. 
By just making the point where you break up files be a consistent rule 
("always break into 256MB pieces"), it would be a well-behaved design (ie 
things like behaviour convergence wrt the same big file being created 
different ways).

HOWEVER.

While that would fit in the git design (ie it would be just a fairly 
straightforward extension - another level of indirection, kind of the way 
we added subprojects), it would still be a rewrite of some core stuff. The 
actual number of lines might not be too horrid, but quite frankly, I 
wouldn't want to do it personally. It would be a lot of work with lots of 
careful special case handling - and no real upside for normal use.

So I'm kind of down on it. I would suggest just admitting that git isn't 
very good at big individual files - especially not if you have a limited 
address space.

So "don't do it then" or "make sure you are 64-bit and have lots of 
memory if you do it" may well be the right solution.

[ And it's really really sad how Apple migrated to x86-32. It was totally 
  unforgivably stupid, and I said so at the time. When Apple did the 
  PowerPC -> x86 transition, they should have just transitioned to x86-64, 
  and never had a 32-bit space.

  But Apple does stupid things, that seem to be driven by marketing rather 
  than thinking deeply about the technology, and now they basically _have_ 
  to default to that 32-bit environment. ]

Oh well. 

			Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 16:59     ` Linus Torvalds
@ 2009-05-27 17:22       ` Christopher Jefferson
  2009-05-27 17:30         ` Jakub Narebski
  2009-05-27 17:37       ` Nicolas Pitre
  1 sibling, 1 reply; 31+ messages in thread
From: Christopher Jefferson @ 2009-05-27 17:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, git

On 27 May 2009, at 17:59, Linus Torvalds wrote:

>
>
> On Wed, 27 May 2009, Linus Torvalds wrote:
>>
>> I'll see if I can make us handle the "big file without diff" case  
>> better
>> by chunking.

> So "don't do it then" or "make sure you are 64-bit and have lots of
> memory if you do it" may well be the right solution.

Thank you for that description of the problem, I can see how hard it is.

Perhaps it might be useful to think about how to codify "don't do it  
then" in a reasonably simple, automatic way?

I've been trying to write a pre-commit hook (I think that's the right  
place?) which would refuse commits larger than some file size (512MB  
as a random number I decided), but am having trouble getting it to  
work right, and generally. Would such a thing be easy, and would that  
be the right place to put it?

While I wouldn't suggest this become default, providing such a hook,  
and describing why you might want to use it, would seem to avoid the  
accidental part of the problem.

Of course, people should really notice that they are submitting large  
files, but it's easy(ish) to commit some output file from a program,  
without realising the file ended up being the wrong side of 1GB.

Chris

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 17:22       ` Christopher Jefferson
@ 2009-05-27 17:30         ` Jakub Narebski
  0 siblings, 0 replies; 31+ messages in thread
From: Jakub Narebski @ 2009-05-27 17:30 UTC (permalink / raw)
  To: Christopher Jefferson; +Cc: Linus Torvalds, git

On Wed, 27 May 2009, Christopher Jefferson wrote:
> On 27 May 2009, at 17:59, Linus Torvalds wrote:
>> On Wed, 27 May 2009, Linus Torvalds wrote:
>>>
>>> I'll see if I can make us handle the "big file without diff" case  
>>> better by chunking.
 
>> So "don't do it then" or "make sure you are 64-bit and have lots of
>> memory if you do it" may well be the right solution.
> 
> Thank you for that description of the problem, I can see how hard it is.
> 
> Perhaps it might be useful to think about how to codify "don't do it  
> then" in a reasonably simple, automatic way?
> 
> I've been trying to write a pre-commit hook (I think that's the right  
> place?) which would refuse commits larger than some file size (512MB  
> as a random number I decided), but am having trouble getting it to  
> work right, and generally. Would such a thing be easy, and would that  
> be the right place to put it?
> 
> While I wouldn't suggest this become default, providing such a hook,  
> and describing why you might want to use it, would seem to avoid the  
> accidental part of the problem.

Hmmm... this is another issue (beside checking for portability of
filenames) that would be neatly solved if there was 'pre-add' hook,
rather than trying to use 'pre-commit' hook for that.  It should not,
I think, be that hard to add it...

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 16:59     ` Linus Torvalds
  2009-05-27 17:22       ` Christopher Jefferson
@ 2009-05-27 17:37       ` Nicolas Pitre
  2009-05-27 21:53         ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Nicolas Pitre @ 2009-05-27 17:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, Christopher Jefferson, git

On Wed, 27 May 2009, Linus Torvalds wrote:

> Hmm. No. Looking at it some more, we could add some nasty code to do 
> _some_ things chunked (like adding a new file as a single object), but it 
> doesn't really help. For any kind of useful thing, we'd need to handle the 
> "read from pack" case in multiple chunks too, and that gets really nasty 
> really quickly.
> 
> The whole "each object as one allocation" design is pretty core, and it 
> looks pointless to have a few special cases, when any actual relevant use 
> would need a whole lot more than the few simple ones.
> 
> Git really doesn't like big individual objects.
> 
> I've occasionally thought about handling big files as multiple big 
> objects: we'd split them into a "pseudo-directory" (it would have some new 
> object ID), and then treat them as a magical special kind of directory 
> that just happens to be represented as one large file on the filesystem.
> 
> That would mean that if you have a huge file, git internally would never 
> think of it as one big file, but as a collection of many smaller objects. 
> By just making the point where you break up files be a consistent rule 
> ("always break into 256MB pieces"), it would be a well-behaved design (ie 
> things like behaviour convergence wrt the same big file being created 
> different ways).
> 
> HOWEVER.
> 
> While that would fit in the git design (ie it would be just a fairly 
> straightforward extension - another level of indirection, kind of the way 
> we added subprojects), it would still be a rewrite of some core stuff. The 
> actual number of lines might not be too horrid, but quite frankly, I 
> wouldn't want to do it personally. It would be a lot of work with lots of 
> careful special case handling - and no real upside for normal use.

My idea for handling big files is simply to:

 1) Define a new parameter to determine what is considered a big file.

 2) Store any file larger than the treshold defined in (1) directly into 
    a pack of their own at "git add" time.

 3) Never attempt to diff nor delta large objects, again according to 
    (1) above.  It is typical for large files not to be deltifiable, and 
    a diff for files in the thousands of megabytes cannot possibly be 
    sane.

The idea is to avoid ever needing to load such object's content entirely 
in memory. So with the data already in a pack, the pack data reuse logic 
(which already does its copy in chunks) could be triggered during a 
repack/fetch/push.

This is also quite trivial to implement with very few special cases, and 
then git would handle huge repositories with lots of huge files just as 
well as any other SCMs.  The usual git repository compactness won't be 
there of course, but I doubt people dealing with repositories in the 
hundreds of gigabytes really care.


Nicolas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 17:37       ` Nicolas Pitre
@ 2009-05-27 21:53         ` Jeff King
  2009-05-27 22:07           ` Linus Torvalds
  2009-05-27 23:29           ` Nicolas Pitre
  0 siblings, 2 replies; 31+ messages in thread
From: Jeff King @ 2009-05-27 21:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git

On Wed, May 27, 2009 at 01:37:26PM -0400, Nicolas Pitre wrote:

> My idea for handling big files is simply to:
> 
>  1) Define a new parameter to determine what is considered a big file.
> 
>  2) Store any file larger than the treshold defined in (1) directly into 
>     a pack of their own at "git add" time.
> 
>  3) Never attempt to diff nor delta large objects, again according to 
>     (1) above.  It is typical for large files not to be deltifiable, and 
>     a diff for files in the thousands of megabytes cannot possibly be 
>     sane.

What about large files that have a short metadata section that may
change? Versions with only the metadata changed delta well, and with a
custom diff driver, can produce useful diffs. And I don't think that is
an impractical or unlikely example; large files can often be tagged
media.

Linus' "split into multiple objects" approach means you could perhaps
split intelligently into metadata and "uninteresting data" sections
based on the file type.  That would make things like rename detection
very fast. Of course it has the downside that you are cementing whatever
split you made into history for all time. And it means that two people
adding the same content might end up with different trees. Both things
that git tries to avoid.

I wonder if it would be useful to make such a split at _read_ time. That
is, still refer to the sha-1 of the whole content in the tree objects,
but have a separate cache that says "hash X splits to the concatenation
of Y,Z". Thus you can always refer to the "pure" object, both as a user,
and in the code. So we could avoid retrofitting all of the code -- just
some parts like diff might want to handle an object in multiple
segments.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 21:53         ` Jeff King
@ 2009-05-27 22:07           ` Linus Torvalds
  2009-05-27 23:09             ` Alan Manuel Gloria
  2009-05-28 19:43             ` Jeff King
  2009-05-27 23:29           ` Nicolas Pitre
  1 sibling, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2009-05-27 22:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Jakub Narebski, Christopher Jefferson, git

On Wed, 27 May 2009, Jeff King wrote:
>
> Linus' "split into multiple objects" approach means you could perhaps
> split intelligently into metadata and "uninteresting data" sections
> based on the file type.

I suspect you wouldn't even need to. A regular delta algorithm would just 
work fairly well to find the common parts.

Sure, if the offset of the data changes a lot, then you'd miss all the 
deltas between two (large) objects that now have data that traverses 
object boundaries, but especially if the split size is pretty large (ie 
several tens of MB, possibly something like 256M), that's still going to 
be a pretty rare event.

IOW, imagine that you have a big file that is 2GB in size, and you prepend 
100kB of data to it (that's why it's so big - you keep prepending data to 
it as some kind of odd ChangeLog file). What happens? It would still delta 
fairly well, even if the delta's would now be:

 - 100kB of new data
 - 256M - 100kB of old data as a small delta entry

and the _next_ chunk woul be:

 - 100kB of "new" data (old data from the previous chunk)
 - 256M - 100kB of old data as a small delta entry

.. and so on for each chunk. So if the whole file is 2GB, it would be 
roughly 8 256MB chunks, and it would delta perfectly well: except for the 
overlap, that would now be 8x 100kB "slop" deltas.

So even a totally unmodified delta algorithm would shrink down the two 
copies of a ~2GB file to one copy + 900kB of extra delta.

Sure, a perfect xdelta thing that would have treated it as one huge file 
would have had just 100kB of delta data, but 900kB would still be a *big* 
saving over duplicating the whole 2GB.

> That would make things like rename detection very fast. Of course it has 
> the downside that you are cementing whatever split you made into history 
> for all time. And it means that two people adding the same content might 
> end up with different trees. Both things that git tries to avoid.

It's the "I can no longer see that the files are the same by comparing 
SHA1's" that I personally dislike.

So my "fixed chunk" approach would be nice in that if you have this kind 
of "chunkblob" entry, in the tree (and index) it would literally be one 
entry, and look like that:

   100644 chunkblob <sha1>

so you could compare two trees that have the same chunkblob entry, and 
just see that they are the same without ever looking at the (humongous) 
data.

The <chunkblob> type itself would then look like just an array of SHA1's, 
ie it would literally be an object that only points to other blobs. Kind 
of a "simplified tree object", if you will.

I think it would fit very well in the git model. But it's a nontrivial 
amount of changes.

			Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 22:07           ` Linus Torvalds
@ 2009-05-27 23:09             ` Alan Manuel Gloria
  2009-05-28  1:56               ` Linus Torvalds
  2009-05-28 19:43             ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Alan Manuel Gloria @ 2009-05-27 23:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff King, Nicolas Pitre, Jakub Narebski, Christopher Jefferson,
	git

On Thu, May 28, 2009 at 6:07 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Wed, 27 May 2009, Jeff King wrote:
>>
>> Linus' "split into multiple objects" approach means you could perhaps
>> split intelligently into metadata and "uninteresting data" sections
>> based on the file type.
>
> I suspect you wouldn't even need to. A regular delta algorithm would just
> work fairly well to find the common parts.
>
> Sure, if the offset of the data changes a lot, then you'd miss all the
> deltas between two (large) objects that now have data that traverses
> object boundaries, but especially if the split size is pretty large (ie
> several tens of MB, possibly something like 256M), that's still going to
> be a pretty rare event.
>
> IOW, imagine that you have a big file that is 2GB in size, and you prepend
> 100kB of data to it (that's why it's so big - you keep prepending data to
> it as some kind of odd ChangeLog file). What happens? It would still delta
> fairly well, even if the delta's would now be:
>
>  - 100kB of new data
>  - 256M - 100kB of old data as a small delta entry
>
> and the _next_ chunk woul be:
>
>  - 100kB of "new" data (old data from the previous chunk)
>  - 256M - 100kB of old data as a small delta entry
>
> .. and so on for each chunk. So if the whole file is 2GB, it would be
> roughly 8 256MB chunks, and it would delta perfectly well: except for the
> overlap, that would now be 8x 100kB "slop" deltas.
>
> So even a totally unmodified delta algorithm would shrink down the two
> copies of a ~2GB file to one copy + 900kB of extra delta.
>
> Sure, a perfect xdelta thing that would have treated it as one huge file
> would have had just 100kB of delta data, but 900kB would still be a *big*
> saving over duplicating the whole 2GB.
>
>> That would make things like rename detection very fast. Of course it has
>> the downside that you are cementing whatever split you made into history
>> for all time. And it means that two people adding the same content might
>> end up with different trees. Both things that git tries to avoid.
>
> It's the "I can no longer see that the files are the same by comparing
> SHA1's" that I personally dislike.
>
> So my "fixed chunk" approach would be nice in that if you have this kind
> of "chunkblob" entry, in the tree (and index) it would literally be one
> entry, and look like that:
>
>   100644 chunkblob <sha1>
>
> so you could compare two trees that have the same chunkblob entry, and
> just see that they are the same without ever looking at the (humongous)
> data.
>
> The <chunkblob> type itself would then look like just an array of SHA1's,
> ie it would literally be an object that only points to other blobs. Kind
> of a "simplified tree object", if you will.
>
> I think it would fit very well in the git model. But it's a nontrivial
> amount of changes.
>
>                        Linus

I'd like to pitch in that our mother company uses Subversion, and they
consistently push very large binaries onto their Subvesion
repositories (I know it's not a good idea.  They do it nevertheless.
The very large binary is a description of a design in a proprietary
format by a proprietary tool; they don't want to keep running that
tool because of licensing etc issues, so they archive it on
Subversion).

I'm trying to convince the mother company to switch to git, mostly
because our company (the daughter company) doesn't have direct access
to their Subversion repo (we're in another country), and I've become
convinced that distributed repos like git are the way to go.  But the
fact that large binaries require me to turn off gc.auto and otherwise
avoid packing large filles makes my case a harder sell; quite a bit of
the mother company's workflow has been integrated with Subversion.

Note that in my case "large binary" is really a 164Mb file, but my
work system is a dual-core 512Mb computer, so I suppose my hardware is
really the limitation; still, some of the computers at the mother
company are even lousier.

If you'd prefer someone else to hack it, can you at least give me some
pointers on which code files to start looking?  I'd really like to
have proper large-file-packing support, where large file is anything
much bigger than a megabyte or so.

Admittedly I'm not a filesystems guy and I can just barely grok git's
blobs (they're the actual files, right? except they're named with
their hash), but not packs (err, a bunch of files?) and trees (brown
and green stuff you plant?).  Still, I can try to learn it.

Sincerely,
AmkG

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 23:09             ` Alan Manuel Gloria
@ 2009-05-28  1:56               ` Linus Torvalds
  2009-05-28  3:26                 ` Nicolas Pitre
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2009-05-28  1:56 UTC (permalink / raw)
  To: Alan Manuel Gloria
  Cc: Jeff King, Nicolas Pitre, Jakub Narebski, Christopher Jefferson,
	git



On Thu, 28 May 2009, Alan Manuel Gloria wrote:
> 
> If you'd prefer someone else to hack it, can you at least give me some
> pointers on which code files to start looking?  I'd really like to
> have proper large-file-packing support, where large file is anything
> much bigger than a megabyte or so.
> 
> Admittedly I'm not a filesystems guy and I can just barely grok git's
> blobs (they're the actual files, right? except they're named with
> their hash), but not packs (err, a bunch of files?) and trees (brown
> and green stuff you plant?).  Still, I can try to learn it.

The packs is a big part of the complexity.

If you were to keep the big files as unpacked blobs, that would be 
fairly simple - but the pack-file format is needed for fetching and 
pushing things, so it's not really an option.

For your particular case, the simplest approach is probably to just 
limit the delta search. Something like just saying "if the object is 
larger than X, don't even bother to try to delta it, and just pack it 
without delta compression". 

The code would still load that whole object in one go, but it sounds like 
you can handle _one_ object at a time. So for your case, I don't think you 
need a fundamental git change - you'd be ok with just an inefficient pack 
format for large files that are very expensive to pack otherwise.

You can already do that by using .gitattributes to not delta entries 
by name, but maybe it's worth doing explicitly by size too.

I realize that the "delta" attribute is apparently almost totally 
undocumented. But if your big blobs have a particular name pattern, what 
you should try is to do something like

 - in your '.gitattributes' file (or .git/info/attributes if you don't 
   want to check it in), add a line like

	*.img !delta

   which now sets the 'delta' attribute to false for all objects that 
   match the '*.img' pattern.

 - see if pack creation is now acceptable (ie do a "git gc" or try to push 
   somewhere)

Something like the following may also work, as a more generic "just don't 
even bother trying to delta huge files".

Totally untested. Maybe it works. Maybe it doesn't.

		Linus

---
 Documentation/config.txt |    7 +++++++
 builtin-pack-objects.c   |    9 +++++++++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 2c03162..8c21027 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1238,6 +1238,13 @@ older version of git. If the `{asterisk}.pack` file is smaller than 2 GB, howeve
 you can use linkgit:git-index-pack[1] on the *.pack file to regenerate
 the `{asterisk}.idx` file.
 
+pack.packDeltaLimit::
+	The default maximum size of objects that we try to delta.
++
+Big files can be very expensive to delta, and if they are large binary
+blobs, there is likely little upside to it anyway. So just pack them
+as-is, and don't waste time on them.
+
 pack.packSizeLimit::
 	The default maximum size of a pack.  This setting only affects
 	packing to a file, i.e. the git:// protocol is unaffected.  It
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 9742b45..9a0072b 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -85,6 +85,7 @@ static struct progress *progress_state;
 static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 static int pack_compression_seen;
 
+static unsigned long pack_delta_limit = 64*1024*1024;
 static unsigned long delta_cache_size = 0;
 static unsigned long max_delta_cache_size = 0;
 static unsigned long cache_max_small_delta_size = 1000;
@@ -1270,6 +1271,10 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	if (trg_entry->type != src_entry->type)
 		return -1;
 
+	/* If we limit delta generation, don't even bother for larger blobs */
+	if (pack_delta_limit && trg_entry->size >= pack_delta_limit)
+		return -1;
+
 	/*
 	 * We do not bother to try a delta that we discarded
 	 * on an earlier try, but only when reusing delta data.
@@ -1865,6 +1870,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		pack_size_limit_cfg = git_config_ulong(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.packdeltalimit")) {
+		pack_delta_limit = git_config_ulong(k, v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-28  1:56               ` Linus Torvalds
@ 2009-05-28  3:26                 ` Nicolas Pitre
  2009-05-28  4:21                   ` Eric Raible
  0 siblings, 1 reply; 31+ messages in thread
From: Nicolas Pitre @ 2009-05-28  3:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Manuel Gloria, Jeff King, Jakub Narebski,
	Christopher Jefferson, git

On Wed, 27 May 2009, Linus Torvalds wrote:

> Something like the following may also work, as a more generic "just don't 
> even bother trying to delta huge files".
> 
> Totally untested. Maybe it works. Maybe it doesn't.
> 
> 		Linus
> 
> ---
>  Documentation/config.txt |    7 +++++++
>  builtin-pack-objects.c   |    9 +++++++++
>  2 files changed, 16 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 2c03162..8c21027 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1238,6 +1238,13 @@ older version of git. If the `{asterisk}.pack` file is smaller than 2 GB, howeve
>  you can use linkgit:git-index-pack[1] on the *.pack file to regenerate
>  the `{asterisk}.idx` file.
>  
> +pack.packDeltaLimit::
> +	The default maximum size of objects that we try to delta.

The option name feels a bit wrong here, like if it meant the max number 
of deltas in a pack.  Nothing better comes to my mind at the moment 
though.

> diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> index 9742b45..9a0072b 100644
> --- a/builtin-pack-objects.c
> +++ b/builtin-pack-objects.c
> @@ -85,6 +85,7 @@ static struct progress *progress_state;
>  static int pack_compression_level = Z_DEFAULT_COMPRESSION;
>  static int pack_compression_seen;
>  
> +static unsigned long pack_delta_limit = 64*1024*1024;
>  static unsigned long delta_cache_size = 0;
>  static unsigned long max_delta_cache_size = 0;
>  static unsigned long cache_max_small_delta_size = 1000;
> @@ -1270,6 +1271,10 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>  	if (trg_entry->type != src_entry->type)
>  		return -1;
>  
> +	/* If we limit delta generation, don't even bother for larger blobs */
> +	if (pack_delta_limit && trg_entry->size >= pack_delta_limit)
> +		return -1;

I'd suggest filtering delta candidates out of delta_list up front in 
prepare_pack() instead.


Nicolas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-28  3:26                 ` Nicolas Pitre
@ 2009-05-28  4:21                   ` Eric Raible
  2009-05-28  4:30                     ` Shawn O. Pearce
  2009-05-28 17:41                     ` Nicolas Pitre
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Raible @ 2009-05-28  4:21 UTC (permalink / raw)
  To: git

Nicolas Pitre <nico <at> cam.org> writes:

> On Wed, 27 May 2009, Linus Torvalds wrote:
> 
> > +pack.packDeltaLimit::
> > +	The default maximum size of objects that we try to delta.
> 
> The option name feels a bit wrong here, like if it meant the max number 
> of deltas in a pack.  Nothing better comes to my mind at the moment 
> though.

pack.maxDeltaSize sounds weird when said aloud.
How about pack.deltaMaxSize?

- Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-28  4:21                   ` Eric Raible
@ 2009-05-28  4:30                     ` Shawn O. Pearce
  2009-05-28  5:52                       ` Eric Raible
  2009-05-28 17:41                     ` Nicolas Pitre
  1 sibling, 1 reply; 31+ messages in thread
From: Shawn O. Pearce @ 2009-05-28  4:30 UTC (permalink / raw)
  To: Eric Raible; +Cc: git

Eric Raible <raible@gmail.com> wrote:
> Nicolas Pitre <nico <at> cam.org> writes:
> > On Wed, 27 May 2009, Linus Torvalds wrote:
> > 
> > > +pack.packDeltaLimit::
> > > +	The default maximum size of objects that we try to delta.
> > 
> > The option name feels a bit wrong here, like if it meant the max number 
> > of deltas in a pack.  Nothing better comes to my mind at the moment 
> > though.
> 
> pack.maxDeltaSize sounds weird when said aloud.
> How about pack.deltaMaxSize?

That sounds like, how big should a delta be?  E.g. set it to 200
and any delta instruction stream over 200 bytes would be discarded,
causing the whole object to be stored instead.  Which is obviously
somewhat silly, but that's the way I'd read that option...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-28  4:30                     ` Shawn O. Pearce
@ 2009-05-28  5:52                       ` Eric Raible
  2009-05-28  8:52                         ` Andreas Ericsson
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Raible @ 2009-05-28  5:52 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

On Wed, May 27, 2009 at 9:30 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> Eric Raible <raible@gmail.com> wrote:
>> Nicolas Pitre <nico <at> cam.org> writes:
>> > On Wed, 27 May 2009, Linus Torvalds wrote:
>> >
>> > > +pack.packDeltaLimit::
>> > > + The default maximum size of objects that we try to delta.
>> >
>> > The option name feels a bit wrong here, like if it meant the max number
>> > of deltas in a pack.  Nothing better comes to my mind at the moment
>> > though.
>>
>> pack.maxDeltaSize sounds weird when said aloud.
>> How about pack.deltaMaxSize?
>
> That sounds like, how big should a delta be?  E.g. set it to 200
> and any delta instruction stream over 200 bytes would be discarded,
> causing the whole object to be stored instead.  Which is obviously
> somewhat silly, but that's the way I'd read that option...
>
> --
> Shawn.

You're right, that _is_ a strange color for the bike shed...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-28  5:52                       ` Eric Raible
@ 2009-05-28  8:52                         ` Andreas Ericsson
  0 siblings, 0 replies; 31+ messages in thread
From: Andreas Ericsson @ 2009-05-28  8:52 UTC (permalink / raw)
  To: Eric Raible; +Cc: Shawn O. Pearce, git

Eric Raible wrote:
> On Wed, May 27, 2009 at 9:30 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
>> Eric Raible <raible@gmail.com> wrote:
>>> Nicolas Pitre <nico <at> cam.org> writes:
>>>> On Wed, 27 May 2009, Linus Torvalds wrote:
>>>>
>>>>> +pack.packDeltaLimit::
>>>>> + The default maximum size of objects that we try to delta.
>>>> The option name feels a bit wrong here, like if it meant the max number
>>>> of deltas in a pack.  Nothing better comes to my mind at the moment
>>>> though.
>>> pack.maxDeltaSize sounds weird when said aloud.
>>> How about pack.deltaMaxSize?
>> That sounds like, how big should a delta be?  E.g. set it to 200
>> and any delta instruction stream over 200 bytes would be discarded,
>> causing the whole object to be stored instead.  Which is obviously
>> somewhat silly, but that's the way I'd read that option...
>>
>> --
>> Shawn.
> 
> You're right, that _is_ a strange color for the bike shed...

Since 'delta' names both the action and the result of the action, it's
tricky to get it unambiguous without helping the grammar along a little.

    pack.maxFileSizeToDelta

is probably the shortest we're going to get it while avoiding ambiguity.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-28  4:21                   ` Eric Raible
  2009-05-28  4:30                     ` Shawn O. Pearce
@ 2009-05-28 17:41                     ` Nicolas Pitre
  1 sibling, 0 replies; 31+ messages in thread
From: Nicolas Pitre @ 2009-05-28 17:41 UTC (permalink / raw)
  To: Eric Raible; +Cc: git

[ please don't drop original sender address and CC unless asked to ]

On Thu, 28 May 2009, Eric Raible wrote:

> Nicolas Pitre <nico <at> cam.org> writes:
> 
> > On Wed, 27 May 2009, Linus Torvalds wrote:
> > 
> > > +pack.packDeltaLimit::
> > > +	The default maximum size of objects that we try to delta.
> > 
> > The option name feels a bit wrong here, like if it meant the max number 
> > of deltas in a pack.  Nothing better comes to my mind at the moment 
> > though.
> 
> pack.maxDeltaSize sounds weird when said aloud.
> How about pack.deltaMaxSize?

pack.MaxSizeForDelta

Whatever...


Nicolas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 22:07           ` Linus Torvalds
  2009-05-27 23:09             ` Alan Manuel Gloria
@ 2009-05-28 19:43             ` Jeff King
  2009-05-28 19:49               ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: Jeff King @ 2009-05-28 19:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jakub Narebski, Christopher Jefferson, git

On Wed, May 27, 2009 at 03:07:49PM -0700, Linus Torvalds wrote:

> I suspect you wouldn't even need to. A regular delta algorithm would just 
> work fairly well to find the common parts.
> 
> Sure, if the offset of the data changes a lot, then you'd miss all the 
> deltas between two (large) objects that now have data that traverses 
> object boundaries, but especially if the split size is pretty large (ie 
> several tens of MB, possibly something like 256M), that's still going to 
> be a pretty rare event.

I confess that I'm not just interested in the _size_ of the deltas, but
also speeding up deltification and rename detection. And I'm interested
in files where we can benefit from their semantics a bit. So yes, with
some overlap you would end up with pretty reasonable deltas for
arbitrary binary, as you describe.

But I was thinking something more like splitting a JPEG into a small
first chunk that contains EXIF data, and a big secondary chunk that
contains the actual image data. The second half is marked as not
compressible (since it is already lossily compressed), and not
interesting for deltification. When we consider two images for
deltification, either:

  1. they have the same "uninteresting" big part. In that case, you can
     trivially make a delta by just replacing the smaller first part (or
     even finding the optimal delta between the small parts). You never
     even need to look at the second half.

  2. they don't have the same uninteresting part. You can reject them as
     delta candidates, because there is little chance the big parts will
     be related, even for a different version of the same image.

And that extends to rename detection, as well. You can avoid looking at
the big part at all if you assume big parts with differing hashes are
going to be drastically different.

> > That would make things like rename detection very fast. Of course it has 
> > the downside that you are cementing whatever split you made into history 
> > for all time. And it means that two people adding the same content might 
> > end up with different trees. Both things that git tries to avoid.
> 
> It's the "I can no longer see that the files are the same by comparing 
> SHA1's" that I personally dislike.

Right. I don't think splitting in the git data structure itself is worth
it for that reason. But deltification and rename detection keeping a
cache of smart splits that says "You can represent <sha-1> as this
concatenation of <sha-1>s" means they can still get some advantage (over
multiple runs, certainly, but possibly even over a single run: a smart
splitter might not even have to look at the entire file contents).

> So my "fixed chunk" approach would be nice in that if you have this kind 
> of "chunkblob" entry, in the tree (and index) it would literally be one 
> entry, and look like that:
> 
>    100644 chunkblob <sha1>

But if I am understanding you correctly, you _are_ proposing to munge
the git data structure here. Which means that pre-chunkblob trees will
point to the raw blob, and then post-chunkblob trees will point to the
chunked representation. And that means not being able to use the sha-1
to see that they eventually point to the same content.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-28 19:43             ` Jeff King
@ 2009-05-28 19:49               ` Linus Torvalds
  0 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2009-05-28 19:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Jakub Narebski, Christopher Jefferson, git



On Thu, 28 May 2009, Jeff King wrote:
> 
> > So my "fixed chunk" approach would be nice in that if you have this kind 
> > of "chunkblob" entry, in the tree (and index) it would literally be one 
> > entry, and look like that:
> > 
> >    100644 chunkblob <sha1>
> 
> But if I am understanding you correctly, you _are_ proposing to munge
> the git data structure here. Which means that pre-chunkblob trees will
> point to the raw blob, and then post-chunkblob trees will point to the
> chunked representation. And that means not being able to use the sha-1
> to see that they eventually point to the same content.

Yes. If we were to do this, and people have large chunks, then once you 
start using the chunkblob (for lack of a better word) model, you'll see 
the same object with two different SHA1's. But it's a one-time (and 
one-way - since once it's a chunkblob, older models can't touch it) thing, 
it can never cause any long-term confusion.

(We'll end up with something similar if somebody ever breaks SHA-1 enough 
for us to care - the logical way to handle it is likely to just accept the 
SHA512-160 object name "aliases")

			Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 21:53         ` Jeff King
  2009-05-27 22:07           ` Linus Torvalds
@ 2009-05-27 23:29           ` Nicolas Pitre
  2009-05-28 20:00             ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Nicolas Pitre @ 2009-05-27 23:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git

On Wed, 27 May 2009, Jeff King wrote:

> On Wed, May 27, 2009 at 01:37:26PM -0400, Nicolas Pitre wrote:
> 
> > My idea for handling big files is simply to:
> > 
> >  1) Define a new parameter to determine what is considered a big file.
> > 
> >  2) Store any file larger than the treshold defined in (1) directly into 
> >     a pack of their own at "git add" time.
> > 
> >  3) Never attempt to diff nor delta large objects, again according to 
> >     (1) above.  It is typical for large files not to be deltifiable, and 
> >     a diff for files in the thousands of megabytes cannot possibly be 
> >     sane.
> 
> What about large files that have a short metadata section that may
> change? Versions with only the metadata changed delta well, and with a
> custom diff driver, can produce useful diffs. And I don't think that is
> an impractical or unlikely example; large files can often be tagged
> media.

Sure... but what is the actual data pattern currently used out there?
What does P4 or CVS or SVN do with multiple versions of almost identical 
2GB+ files?

My point is, if the tool people are already using with gigantic 
repositories is not bothering with delta compression then we don't lose 
much in making git usable with those repositories by doing the same.  
And this can be achieved pretty easily with fairly minor changes. Plus, 
my proposal doesn't introduce any incompatibility in the git repository 
format while not denying possible future enhancements.

For example, that would be non trivial but still doable to make git work 
on data streams instead of buffers.  The current code for blob 
read/write/delta could be kept for performance, along with another 
version in parallel doing the same but with file descriptors and 
pread/pwrite for big files.

> Linus' "split into multiple objects" approach means you could perhaps
> split intelligently into metadata and "uninteresting data" sections
> based on the file type.  That would make things like rename detection
> very fast. Of course it has the downside that you are cementing whatever
> split you made into history for all time. And it means that two people
> adding the same content might end up with different trees. Both things
> that git tries to avoid.

Exact.  And honnestly I don't think that would be worth trying to do 
inexact rename detection for huge files anyway.  It is rarely the case 
that moving/renaming a movie file needs to change its content in some 
way.

> I wonder if it would be useful to make such a split at _read_ time. That
> is, still refer to the sha-1 of the whole content in the tree objects,
> but have a separate cache that says "hash X splits to the concatenation
> of Y,Z". Thus you can always refer to the "pure" object, both as a user,
> and in the code. So we could avoid retrofitting all of the code -- just
> some parts like diff might want to handle an object in multiple
> segments.

Unless there are real world scenarios where diffing (as we know it) two 
huge files is a common and useful operation, I don't think we should 
even try to consider that problem.  What people are doing with huge 
files is storing them and retrieving them, so we probably should only 
limit ourselves to making those operations work for now.

And to that effect I don't think it would be wise to introduce 
artificial segmentations in the object structure that would make both 
the code and the git model more complex.  We could just as well limit 
the complexity to the code for dealing with blobs without having to load 
them all in memory at once instead and keep the git repository model 
simple.

So if we want to do the real thing and deal with huge blobs, there is 
only a small set of operations that need to be considered:

 - Creation of new blobs (or "git add") for huge files: can be done 
   trivially in chunks.  Open issue is whether or not the SHA1 of the 
   file is computed with a first pass over the file, and if the object 
   doesn't exist then perform a second pass to deflate it if desired, or 
   do both the SHA1 summing and deflate in the same pass and discard the 
   result if the object happens to already exist.  Still trivial to 
   implement.

 - Checkout of huge files: still trivial to perform if non delta.  In 
   the delta case, that _could_ be quite simple if the base objects 
   were not deflated by recursively parsing deltas.  But again that 
   remains to be seen if 1) deflating or even 2) deltifying huge files
   is useful in practice with real world data.

 - repack/fetch/pull: In the pack data reuse case, the code is already 
   fine as it streams small blocks from the source to the destination.  
   Delta compression can be done by using coarse indexing of the source 
   object and loading/discarding portions of the source data while the 
   target object is processed in a streaming fashion.

Other than that, I don't see how git could be useful for huge files.  
The above operations (read/write/delta of huge blobs) would need to be 
done with a separate set of functions, and a configurable size treshold 
would select the regular or the chunked set.  Nothing fundamentally 
difficult in my mind.

Nicolas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-27 23:29           ` Nicolas Pitre
@ 2009-05-28 20:00             ` Jeff King
  2009-05-28 20:54               ` Nicolas Pitre
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2009-05-28 20:00 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git

On Wed, May 27, 2009 at 07:29:02PM -0400, Nicolas Pitre wrote:

> > What about large files that have a short metadata section that may
> > change? Versions with only the metadata changed delta well, and with a
> > custom diff driver, can produce useful diffs. And I don't think that is
> > an impractical or unlikely example; large files can often be tagged
> > media.
> 
> Sure... but what is the actual data pattern currently used out there?

I'm not sure what you mean by "out there", but I just exactly described
the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and
short (a few dozens of megabytes) AVIs, frequent additions, infrequently
changing photo contents, and moderately changing metadata). I don't know
how that matches other peoples' needs.

Game designers have been mentioned before in large media checkins, and I
think they focus less on metadata changes. Media is either there to
stay, or it is replaced as a whole.

> What does P4 or CVS or SVN do with multiple versions of almost identical 
> 2GB+ files?

I only ever tried this with CVS, which just stored the entire binary
version as a whole. And of course running "diff" was useless, but then
it was also useless on text files. ;) I suspect CVS would simply choke
on a 2G file.

But I don't want to do as well as those other tools. I want to be able
to do all of the useful things git can do but with large files.

> My point is, if the tool people are already using with gigantic 
> repositories is not bothering with delta compression then we don't lose 
> much in making git usable with those repositories by doing the same.
> And this can be achieved pretty easily with fairly minor changes. Plus, 
> my proposal doesn't introduce any incompatibility in the git repository 
> format while not denying possible future enhancements.

Right. I think in some ways we are perhaps talking about two different
problems. I am really interested in moderately large files (a few
megabytes up to a few dozens or even a hundred megabytes), but I want
git to be _fast_ at dealing with them, and doing useful operations on
them (like rename detection, diffing, etc).

A smart splitter would probably want to mark part of the split as "this
section is large and uninteresting for compression, deltas, diffing, and
renames".  And that half may be stored in the way that you are proposing
(in a separate single-object pack, no compression, no delta, etc). So in
a sense I think what I am talking about would built on top of what you
want to do.

> > very fast. Of course it has the downside that you are cementing whatever
> > split you made into history for all time. And it means that two people
> > adding the same content might end up with different trees. Both things
> > that git tries to avoid.
> 
> Exact.  And honnestly I don't think that would be worth trying to do 
> inexact rename detection for huge files anyway.  It is rarely the case 
> that moving/renaming a movie file needs to change its content in some 
> way.

I should have been more clear in my other email: I think splitting that
is represented in the actual git trees is not going to be worth the
hassle.

But I do think we can get some of the benefits by maintaining a split
cache for viewers. And again, maybe my use case is crazy, but in my repo
I have renames and metadata content changes together.

> Unless there are real world scenarios where diffing (as we know it) two 
> huge files is a common and useful operation, I don't think we should 
> even try to consider that problem.  What people are doing with huge 
> files is storing them and retrieving them, so we probably should only 
> limit ourselves to making those operations work for now.

Again, this is motivated by a real use case that I have.

> And to that effect I don't think it would be wise to introduce 
> artificial segmentations in the object structure that would make both 
> the code and the git model more complex.  We could just as well limit 
> the complexity to the code for dealing with blobs without having to load 
> them all in memory at once instead and keep the git repository model 
> simple.

I do agree with this; I don't want to make any changes to the repository
model.

> So if we want to do the real thing and deal with huge blobs, there is 
> only a small set of operations that need to be considered:

I think everything you say here is sensible; I just want more operations
for my use case.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-28 20:00             ` Jeff King
@ 2009-05-28 20:54               ` Nicolas Pitre
  2009-05-28 21:21                 ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Nicolas Pitre @ 2009-05-28 20:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3519 bytes --]

On Thu, 28 May 2009, Jeff King wrote:

> On Wed, May 27, 2009 at 07:29:02PM -0400, Nicolas Pitre wrote:
> 
> > > What about large files that have a short metadata section that may
> > > change? Versions with only the metadata changed delta well, and with a
> > > custom diff driver, can produce useful diffs. And I don't think that is
> > > an impractical or unlikely example; large files can often be tagged
> > > media.
> > 
> > Sure... but what is the actual data pattern currently used out there?
> 
> I'm not sure what you mean by "out there", but I just exactly described
> the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and
> short (a few dozens of megabytes) AVIs, frequent additions, infrequently
> changing photo contents, and moderately changing metadata). I don't know
> how that matches other peoples' needs.

How do diffing à la 'git diff' JPEGs or AVIs make sense?

Also, you certainly have little to delta against as you add new photos 
more often than modifying existing ones?

> Game designers have been mentioned before in large media checkins, and I
> think they focus less on metadata changes. Media is either there to
> stay, or it is replaced as a whole.

Right.  And my proposal fits that scenario pretty well.

> > What does P4 or CVS or SVN do with multiple versions of almost identical 
> > 2GB+ files?
> 
> I only ever tried this with CVS, which just stored the entire binary
> version as a whole. And of course running "diff" was useless, but then
> it was also useless on text files. ;) I suspect CVS would simply choke
> on a 2G file.
> 
> But I don't want to do as well as those other tools. I want to be able
> to do all of the useful things git can do but with large files.

Right now git simply do much worse.  So doing as well is still a worthy 
goal.

> Right. I think in some ways we are perhaps talking about two different
> problems. I am really interested in moderately large files (a few
> megabytes up to a few dozens or even a hundred megabytes), but I want
> git to be _fast_ at dealing with them, and doing useful operations on
> them (like rename detection, diffing, etc).

I still can't see how diffing big files is useful.  Certainly you'll 
need a specialized external diff tool, in which case it is not git's 
problem anymore except for writing content to temporary files.

Rename detection: either you deal with the big files each time, or you 
(re)create a cache with that information so no analysis is needed the 
second time around.  This is something that even small files might 
possibly benefit from.  But in any case, there is no other ways but to 
bite the bullet at least initially, and big files will be slower to 
process no matter what.

> A smart splitter would probably want to mark part of the split as "this
> section is large and uninteresting for compression, deltas, diffing, and
> renames".  And that half may be stored in the way that you are proposing
> (in a separate single-object pack, no compression, no delta, etc). So in
> a sense I think what I am talking about would built on top of what you
> want to do.

Looks to me like you wish for git to do what a specialized database 
would be much more suited for the task.  Isn't there tools to gather 
picture metadata info, just like itunes does with MP3s already?

> But I do think we can get some of the benefits by maintaining a split
> cache for viewers.

Sure.

But being able to deal with large (1GB and more) files remains a totally 
different problem.

Nicolas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Problem with large files on different OSes
  2009-05-28 20:54               ` Nicolas Pitre
@ 2009-05-28 21:21                 ` Jeff King
  0 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2009-05-28 21:21 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git

On Thu, May 28, 2009 at 04:54:28PM -0400, Nicolas Pitre wrote:

> > I'm not sure what you mean by "out there", but I just exactly described
> > the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and
> > short (a few dozens of megabytes) AVIs, frequent additions, infrequently
> > changing photo contents, and moderately changing metadata). I don't know
> > how that matches other peoples' needs.
> 
> How do diffing à la 'git diff' JPEGs or AVIs make sense?

It is useful to see the changes in a text representation of the
metadata, with a single-line mention if the image or movie data has
changed. It's why I wrote the textconv feature.

> Also, you certainly have little to delta against as you add new photos 
> more often than modifying existing ones?

I do add new photos more than modifying existing ones. But I do modify
the old ones (tag corrections, new tags I didn't think of initially,
updates to the tagging schema, etc), too.

The sum of the sizes for all objects in the repo is 8.3G. The fully
packed repo is 3.3G. So there clearly is some benefit from deltas, and I
don't want to just turn them off. Doing the actual repack is painfully
slow.

> I still can't see how diffing big files is useful.  Certainly you'll 
> need a specialized external diff tool, in which case it is not git's 
> problem anymore except for writing content to temporary files.

Writing content to temporarily files is actually quite slow when the
files are hundreds of megabytes (even "git show" can be painful, let
alone "git log -p"). But that is something that can be dealt with by
improving the interface to external diff and textconv to avoid writing
out the whole file (and is something I have patches in the works for,
but they need finished and cleaned up).

> Rename detection: either you deal with the big files each time, or you 
> (re)create a cache with that information so no analysis is needed the 
> second time around.  This is something that even small files might 
> possibly benefit from.  But in any case, there is no other ways but to 
> bite the bullet at least initially, and big files will be slower to 
> process no matter what.

Right. What I am proposing is basically to create such a cache. But it
is one that is general enough that it could be used for more than just
the rename detection (though arguably rename detection and deltification
could actually share more of the same techniques, in which case a cache
for one would help the other).

> Looks to me like you wish for git to do what a specialized database 
> would be much more suited for the task.  Isn't there tools to gather 
> picture metadata info, just like itunes does with MP3s already?

Yes, I already have tools for handling picture metadata info. How do I
version control that information? How do I keep it in sync across
multiple checkouts? How do I handle merging concurrent changes from
multiple sources? How do I keep that metadata connected to the pictures
that it describes? The things I want to do are conceptually
no different what I do with other files; it's merely the size of the
files that makes working with them in git less convenient (but it does
_work_; I am using git for this _now_, and I have been for a few years).

> But being able to deal with large (1GB and more) files remains a totally 
> different problem.

Right, that is why I think I will end up building on top of what you do.
I am trying to make a way for some operations to avoid looking at the
entire file, even streaming, which should drastically speed up those
operations.  But it is unavoidable that some operations (e.g., "git
add") will have to look at the entire file. And that is what your
proposal is about; streaming is basically the only way forward there.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2009-05-28 21:21 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson
2009-05-27 11:37 ` Andreas Ericsson
2009-05-27 13:02   ` Christopher Jefferson
2009-05-27 13:28   ` John Tapsell
2009-05-27 13:30     ` Christopher Jefferson
2009-05-27 13:32       ` John Tapsell
2009-05-27 14:01 ` Tomas Carnecky
2009-05-27 14:09   ` Christopher Jefferson
2009-05-27 14:22     ` Andreas Ericsson
2009-05-27 14:37 ` Jakub Narebski
2009-05-27 16:30   ` Linus Torvalds
2009-05-27 16:59     ` Linus Torvalds
2009-05-27 17:22       ` Christopher Jefferson
2009-05-27 17:30         ` Jakub Narebski
2009-05-27 17:37       ` Nicolas Pitre
2009-05-27 21:53         ` Jeff King
2009-05-27 22:07           ` Linus Torvalds
2009-05-27 23:09             ` Alan Manuel Gloria
2009-05-28  1:56               ` Linus Torvalds
2009-05-28  3:26                 ` Nicolas Pitre
2009-05-28  4:21                   ` Eric Raible
2009-05-28  4:30                     ` Shawn O. Pearce
2009-05-28  5:52                       ` Eric Raible
2009-05-28  8:52                         ` Andreas Ericsson
2009-05-28 17:41                     ` Nicolas Pitre
2009-05-28 19:43             ` Jeff King
2009-05-28 19:49               ` Linus Torvalds
2009-05-27 23:29           ` Nicolas Pitre
2009-05-28 20:00             ` Jeff King
2009-05-28 20:54               ` Nicolas Pitre
2009-05-28 21:21                 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).