* Problem with large files on different OSes
@ 2009-05-27 10:52 Christopher Jefferson
2009-05-27 11:37 ` Andreas Ericsson
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Christopher Jefferson @ 2009-05-27 10:52 UTC (permalink / raw)
To: git
I recently came across a very annoying problem, characterised by the
following example:
On a recent ubuntu install:
dd if=/dev/zero of=file bs=1300k count=1k
git commit file -m "Add huge file"
The repository can be pulled and pushed successfully to other ubuntu
installs, but on Mac OS X, 10.5.7 machine with 4GB ram git pull
produces:
remote: Counting objects: 6, done.
remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) failed
(error code=12)
remote: *** error: can't allocate region
remote: *** set a breakpoint in malloc_error_break to debug
remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) failed
(error code=12)
remote: *** error: can't allocate region
remote: *** set a breakpoint in malloc_error_break to debug
remote: fatal: Out of memory, malloc failed
error: git upload-pack: git-pack-objects died with error.
fatal: git upload-pack: aborting due to possible repository corruption
on the remote side.
remote: aborting due to possible repository corruption on the remote
side.
fatal: protocol error: bad pack header
The problem appears to be the different maximum mmap sizes available
on different OSes. Whic I don't really mind the maximum file size
restriction git imposes, this restriction varying from OS to OS is
very annoying, fixing this required rewriting history to remove the
commit, which caused problems as the commit had already been pulled,
and built on, by a number of developers.
If the requirement that all files can be mmapped cannot be easily
removed, would be it perhaps be acceptable to impose a (soft?)
1GB(ish) file size limit? I suggest 1GB as all the OSes I can get hold
of easily (freeBSD, windows, Mac OS X, linux) support a mmap of size >
1GB.
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: Problem with large files on different OSes 2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson @ 2009-05-27 11:37 ` Andreas Ericsson 2009-05-27 13:02 ` Christopher Jefferson 2009-05-27 13:28 ` John Tapsell 2009-05-27 14:01 ` Tomas Carnecky 2009-05-27 14:37 ` Jakub Narebski 2 siblings, 2 replies; 31+ messages in thread From: Andreas Ericsson @ 2009-05-27 11:37 UTC (permalink / raw) To: Christopher Jefferson; +Cc: git Christopher Jefferson wrote: > I recently came across a very annoying problem, characterised by the > following example: > > On a recent ubuntu install: > > dd if=/dev/zero of=file bs=1300k count=1k > git commit file -m "Add huge file" > > > The repository can be pulled and pushed successfully to other ubuntu > installs, but on Mac OS X, 10.5.7 machine with 4GB ram git pull produces: > > remote: Counting objects: 6, done. > remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) failed > (error code=12) > remote: *** error: can't allocate region > remote: *** set a breakpoint in malloc_error_break to debug > remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) failed > (error code=12) > remote: *** error: can't allocate region > remote: *** set a breakpoint in malloc_error_break to debug > remote: fatal: Out of memory, malloc failed > error: git upload-pack: git-pack-objects died with error. > fatal: git upload-pack: aborting due to possible repository corruption > on the remote side. > remote: aborting due to possible repository corruption on the remote side. > fatal: protocol error: bad pack header > > > The problem appears to be the different maximum mmap sizes available on > different OSes. Whic I don't really mind the maximum file size > restriction git imposes, this restriction varying from OS to OS is very > annoying, fixing this required rewriting history to remove the commit, > which caused problems as the commit had already been pulled, and built > on, by a number of developers. > > If the requirement that all files can be mmapped cannot be easily > removed, would be it perhaps be acceptable to impose a (soft?) 1GB(ish) > file size limit? Most definitely not. Why should we limit a cross-platform system for the benefit of one particular developer's lacking hardware? Such a convention should, if anything, be enforced by social policy, but not by the tool itself. Otherwise, why not just restrict the tool that created the huge file so that it makes smaller files that fit into git on all platforms instead? (No, that wasn't a real suggestion. It was just to make the point that your suggestion for git to impose artificial limits is equally ludicrous) -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 11:37 ` Andreas Ericsson @ 2009-05-27 13:02 ` Christopher Jefferson 2009-05-27 13:28 ` John Tapsell 1 sibling, 0 replies; 31+ messages in thread From: Christopher Jefferson @ 2009-05-27 13:02 UTC (permalink / raw) To: Andreas Ericsson; +Cc: git On 27 May 2009, at 12:37, Andreas Ericsson wrote: > Christopher Jefferson wrote: >> I recently came across a very annoying problem, characterised by >> the following example: >> On a recent ubuntu install: >> dd if=/dev/zero of=file bs=1300k count=1k >> git commit file -m "Add huge file" >> The repository can be pulled and pushed successfully to other >> ubuntu installs, but on Mac OS X, 10.5.7 machine with 4GB ram git >> pull produces: >> remote: Counting objects: 6, done. >> remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) >> failed (error code=12) >> remote: *** error: can't allocate region >> remote: *** set a breakpoint in malloc_error_break to debug >> remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) >> failed (error code=12) >> remote: *** error: can't allocate region >> remote: *** set a breakpoint in malloc_error_break to debug >> remote: fatal: Out of memory, malloc failed >> error: git upload-pack: git-pack-objects died with error. >> fatal: git upload-pack: aborting due to possible repository >> corruption on the remote side. >> remote: aborting due to possible repository corruption on the >> remote side. >> fatal: protocol error: bad pack header >> The problem appears to be the different maximum mmap sizes >> available on different OSes. Whic I don't really mind the maximum >> file size restriction git imposes, this restriction varying from OS >> to OS is very annoying, fixing this required rewriting history to >> remove the commit, which caused problems as the commit had already >> been pulled, and built on, by a number of developers. >> If the requirement that all files can be mmapped cannot be easily >> removed, would be it perhaps be acceptable to impose a (soft?) >> 1GB(ish) file size limit? > > Most definitely not. Why should we limit a cross-platform system for > the benefit of one particular developer's lacking hardware? Out of curiosity, why do you say lacking hardware? I am running ubuntu, windows and Mac OS X on exactly the same machine, which is not running out of physical memory, never mind swap, when using git on any OS. The problem is purely a software (and OS) problem. Chris ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 11:37 ` Andreas Ericsson 2009-05-27 13:02 ` Christopher Jefferson @ 2009-05-27 13:28 ` John Tapsell 2009-05-27 13:30 ` Christopher Jefferson 1 sibling, 1 reply; 31+ messages in thread From: John Tapsell @ 2009-05-27 13:28 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Christopher Jefferson, git 2009/5/27 Andreas Ericsson <ae@op5.se>: > Christopher Jefferson wrote: >> If the requirement that all files can be mmapped cannot be easily removed, >> would be it perhaps be acceptable to impose a (soft?) 1GB(ish) file size >> limit? > > Most definitely not. Why should we limit a cross-platform system for > the benefit of one particular developer's lacking hardware? Perhaps a simple warning would suffice "Warning: Files larger than 2GB may cause problems when trying to checkout on Windows." John ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 13:28 ` John Tapsell @ 2009-05-27 13:30 ` Christopher Jefferson 2009-05-27 13:32 ` John Tapsell 0 siblings, 1 reply; 31+ messages in thread From: Christopher Jefferson @ 2009-05-27 13:30 UTC (permalink / raw) To: John Tapsell; +Cc: Andreas Ericsson, git On 27 May 2009, at 14:28, John Tapsell wrote: > 2009/5/27 Andreas Ericsson <ae@op5.se>: >> Christopher Jefferson wrote: >>> If the requirement that all files can be mmapped cannot be easily >>> removed, >>> would be it perhaps be acceptable to impose a (soft?) 1GB(ish) >>> file size >>> limit? >> >> Most definitely not. Why should we limit a cross-platform system for >> the benefit of one particular developer's lacking hardware? > > Perhaps a simple warning would suffice "Warning: Files larger than > 2GB may cause problems when trying to checkout on Windows." > Something like that, except that limit seems to be only 1.3GB on Mac OS X Chris ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 13:30 ` Christopher Jefferson @ 2009-05-27 13:32 ` John Tapsell 0 siblings, 0 replies; 31+ messages in thread From: John Tapsell @ 2009-05-27 13:32 UTC (permalink / raw) To: Christopher Jefferson; +Cc: Andreas Ericsson, git 2009/5/27 Christopher Jefferson <caj@cs.st-andrews.ac.uk>: > Something like that, except that limit seems to be only 1.3GB on Mac OS X Does linux have a similar limitation, lower than the limit imposed by the filesystem? Could this be solved by having a fallback solution for mmap? (switching to opening the file normally) Or would this fallback be too intrusive/large of a change? John ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson 2009-05-27 11:37 ` Andreas Ericsson @ 2009-05-27 14:01 ` Tomas Carnecky 2009-05-27 14:09 ` Christopher Jefferson 2009-05-27 14:37 ` Jakub Narebski 2 siblings, 1 reply; 31+ messages in thread From: Tomas Carnecky @ 2009-05-27 14:01 UTC (permalink / raw) To: Christopher Jefferson; +Cc: git On May 27, 2009, at 12:52 PM, Christopher Jefferson wrote: > I recently came across a very annoying problem, characterised by the > following example: > > On a recent ubuntu install: > > dd if=/dev/zero of=file bs=1300k count=1k > git commit file -m "Add huge file" > > > The repository can be pulled and pushed successfully to other ubuntu > installs, but on Mac OS X, 10.5.7 machine with 4GB ram git pull > produces: > > remote: Counting objects: 6, done. > remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) > failed (error code=12) > remote: *** error: can't allocate region > remote: *** set a breakpoint in malloc_error_break to debug > remote: git(1533,0xb0081000) malloc: *** mmap(size=1363152896) > failed (error code=12) > remote: *** error: can't allocate region > remote: *** set a breakpoint in malloc_error_break to debug > remote: fatal: Out of memory, malloc failed > error: git upload-pack: git-pack-objects died with error. > fatal: git upload-pack: aborting due to possible repository > corruption on the remote side. > remote: aborting due to possible repository corruption on the remote > side. > fatal: protocol error: bad pack header > > > The problem appears to be the different maximum mmap sizes available > on different OSes. Whic I don't really mind the maximum file size > restriction git imposes, this restriction varying from OS to OS is > very annoying, fixing this required rewriting history to remove the > commit, which caused problems as the commit had already been pulled, > and built on, by a number of developers. > > If the requirement that all files can be mmapped cannot be easily > removed, would be it perhaps be acceptable to impose a (soft?) > 1GB(ish) file size limit? I suggest 1GB as all the OSes I can get > hold of easily (freeBSD, windows, Mac OS X, linux) support a mmap of > size > 1GB. I think this is a limitation of a 32bit build of git. I just tried with a 64bit build and it added the file just fine. The compiler on MacOSX (gcc) produces 32bit builds by default, even if the system supports 64bit executables. But gcc on 64bit Linux (at least the installations I have at home) produces a 64bit executables by default. Solaris/OpenSolaris behaves like MacOSX, no idea about *BSD or Windows. Maybe this is why git works on Linux but not MacOSX even on the same hardware. Btw, I built git with: make install prefix=... CC="gcc -m64", no modifications needed (MacOSX 10.5.7). tom ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 14:01 ` Tomas Carnecky @ 2009-05-27 14:09 ` Christopher Jefferson 2009-05-27 14:22 ` Andreas Ericsson 0 siblings, 1 reply; 31+ messages in thread From: Christopher Jefferson @ 2009-05-27 14:09 UTC (permalink / raw) To: Tomas Carnecky; +Cc: git On 27 May 2009, at 15:01, Tomas Carnecky wrote: >> >> The problem appears to be the different maximum mmap sizes >> available on different OSes. Whic I don't really mind the maximum >> file size restriction git imposes, this restriction varying from OS >> to OS is very annoying, fixing this required rewriting history to >> remove the commit, which caused problems as the commit had already >> been pulled, and built on, by a number of developers. >> >> If the requirement that all files can be mmapped cannot be easily >> removed, would be it perhaps be acceptable to impose a (soft?) >> 1GB(ish) file size limit? I suggest 1GB as all the OSes I can get >> hold of easily (freeBSD, windows, Mac OS X, linux) support a mmap >> of size > 1GB. > > I think this is a limitation of a 32bit build of git. I just tried > with a 64bit build and it added the file just fine. The compiler on > MacOSX (gcc) produces 32bit builds by default, even if the system > supports 64bit executables. But gcc on 64bit Linux (at least the > installations I have at home) produces a 64bit executables by > default. Solaris/OpenSolaris behaves like MacOSX, no idea about *BSD > or Windows. Maybe this is why git works on Linux but not MacOSX even > on the same hardware. > Btw, I built git with: make install prefix=... CC="gcc -m64", no > modifications needed (MacOSX 10.5.7). The git installs I am using are all 32bit, this machine doesn't have a 64bit processor (it is one of the few macs released without one). It's nice to know long term this problem will go away, that all suggests introducing some limit is not approriate, as while 32bit users have some arbitary limit above which they cannot go, I am sure all 64-bit OSes will manage to easily mmap any file. Of course warning such users they are producing packs that are not going to work on 32bit compiles of git isn't a stupid idea. Chris ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 14:09 ` Christopher Jefferson @ 2009-05-27 14:22 ` Andreas Ericsson 0 siblings, 0 replies; 31+ messages in thread From: Andreas Ericsson @ 2009-05-27 14:22 UTC (permalink / raw) To: Christopher Jefferson; +Cc: Tomas Carnecky, git Christopher Jefferson wrote: > > On 27 May 2009, at 15:01, Tomas Carnecky wrote: > >>> >>> The problem appears to be the different maximum mmap sizes available >>> on different OSes. Whic I don't really mind the maximum file size >>> restriction git imposes, this restriction varying from OS to OS is >>> very annoying, fixing this required rewriting history to remove the >>> commit, which caused problems as the commit had already been pulled, >>> and built on, by a number of developers. >>> >>> If the requirement that all files can be mmapped cannot be easily >>> removed, would be it perhaps be acceptable to impose a (soft?) >>> 1GB(ish) file size limit? I suggest 1GB as all the OSes I can get >>> hold of easily (freeBSD, windows, Mac OS X, linux) support a mmap of >>> size > 1GB. >> >> I think this is a limitation of a 32bit build of git. I just tried >> with a 64bit build and it added the file just fine. The compiler on >> MacOSX (gcc) produces 32bit builds by default, even if the system >> supports 64bit executables. But gcc on 64bit Linux (at least the >> installations I have at home) produces a 64bit executables by default. >> Solaris/OpenSolaris behaves like MacOSX, no idea about *BSD or >> Windows. Maybe this is why git works on Linux but not MacOSX even on >> the same hardware. >> Btw, I built git with: make install prefix=... CC="gcc -m64", no >> modifications needed (MacOSX 10.5.7). > > The git installs I am using are all 32bit, this machine doesn't have a > 64bit processor (it is one of the few macs released without one). It's > nice to know long term this problem will go away, that all suggests > introducing some limit is not approriate, as while 32bit users have some > arbitary limit above which they cannot go, I am sure all 64-bit OSes > will manage to easily mmap any file. Of course warning such users they > are producing packs that are not going to work on 32bit compiles of git > isn't a stupid idea. > mmap()'ing large files (> 4GB) work just fine on Linux. You can't mmap() more than 4GB at a time though (I think; I didn't try), but since we don't do that anyway I doubt that was the problem. The file you produced with your dd command should have ended up being 1239MB, or 1.21GB, so the real hard limit for MacOSX seem to be 1GB if, indeed, there is one. On the other hand, the error message you got ("fatal: Out of memory, malloc failed") seems to indicate the system actually had no memory left when you tried to garbage-collect your repository. Are you using a dual-core system? If so, please try again with pack.threads = 1 set in the .git/config file of that particular repository. Each thread will allocate roughly the same amount of memory, so if both of them had to handle that huge blob at the same time, they'd have exploded memory usage up to 1.3GB + the compressed size of them + DAG-bookkeeping etc etc. I'm guessing we'd have seen error reports from other OSX users if it was actually impossible to mmap() 1GB files in git on OSX. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson 2009-05-27 11:37 ` Andreas Ericsson 2009-05-27 14:01 ` Tomas Carnecky @ 2009-05-27 14:37 ` Jakub Narebski 2009-05-27 16:30 ` Linus Torvalds 2 siblings, 1 reply; 31+ messages in thread From: Jakub Narebski @ 2009-05-27 14:37 UTC (permalink / raw) To: Christopher Jefferson; +Cc: git Christopher Jefferson <caj@cs.st-andrews.ac.uk> writes: > I recently came across a very annoying problem, characterised by the > following example: > > On a recent ubuntu install: > > dd if=/dev/zero of=file bs=1300k count=1k > git commit file -m "Add huge file" > > > The repository can be pulled and pushed successfully to other ubuntu > installs, but on Mac OS X, 10.5.7 machine with 4GB ram git pull > produces: Do seting `pack.packSizeLimit`, or adjusting values of `core.packedGitWindowSize` and/or `core.packedGitLimit` (see git-config(1)) help in your situation? -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 14:37 ` Jakub Narebski @ 2009-05-27 16:30 ` Linus Torvalds 2009-05-27 16:59 ` Linus Torvalds 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2009-05-27 16:30 UTC (permalink / raw) To: Jakub Narebski; +Cc: Christopher Jefferson, git On Wed, 27 May 2009, Jakub Narebski wrote: > > Do seting `pack.packSizeLimit`, or adjusting values of > `core.packedGitWindowSize` and/or `core.packedGitLimit` > (see git-config(1)) help in your situation? No, that will help just the packfile mmap (and even there, it won't help with things like index file size - we'll always mmap the whole index file). It's definitely worth doing, though - but I think we already default to 32MB pack-file windows on 32-bit architectures. Individual files we always handle in one go. It's what git was designed for, after all - fairly small files. And so git is limited to files smaller than the virtual address space. On a 32-bit setup, that often limits you to roughly a gigabyte. You have 4GB of virtual address space, of which one or two is used for the OS kernel. So say you have 2GB for user mode - you then have the executable mapping and libraries and stack, all spread out in that 2GB virtual address space. In fact, even if it's 3GB for user (I don't know what OS X does), getting one contiguous area may well be limited to ~1GB depending on layout of shared library mappings etc VM fragmentation. Older Linux systems tended to map things in ways that made it hard to get more than 1GB of contiguous data mapping if you compiled with dynamic libraries. 64-bit mostly makes this a non-issue. In fact, if you do "diff", you're going to be even _more_ limited, since for simplicity, our version of xdiff really wants both sources in memory at a time. So you can't really diff >500MB files. Not that you generally want to, of course. I'll see if I can make us handle the "big file without diff" case better by chunking. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 16:30 ` Linus Torvalds @ 2009-05-27 16:59 ` Linus Torvalds 2009-05-27 17:22 ` Christopher Jefferson 2009-05-27 17:37 ` Nicolas Pitre 0 siblings, 2 replies; 31+ messages in thread From: Linus Torvalds @ 2009-05-27 16:59 UTC (permalink / raw) To: Jakub Narebski; +Cc: Christopher Jefferson, git On Wed, 27 May 2009, Linus Torvalds wrote: > > I'll see if I can make us handle the "big file without diff" case better > by chunking. Hmm. No. Looking at it some more, we could add some nasty code to do _some_ things chunked (like adding a new file as a single object), but it doesn't really help. For any kind of useful thing, we'd need to handle the "read from pack" case in multiple chunks too, and that gets really nasty really quickly. The whole "each object as one allocation" design is pretty core, and it looks pointless to have a few special cases, when any actual relevant use would need a whole lot more than the few simple ones. Git really doesn't like big individual objects. I've occasionally thought about handling big files as multiple big objects: we'd split them into a "pseudo-directory" (it would have some new object ID), and then treat them as a magical special kind of directory that just happens to be represented as one large file on the filesystem. That would mean that if you have a huge file, git internally would never think of it as one big file, but as a collection of many smaller objects. By just making the point where you break up files be a consistent rule ("always break into 256MB pieces"), it would be a well-behaved design (ie things like behaviour convergence wrt the same big file being created different ways). HOWEVER. While that would fit in the git design (ie it would be just a fairly straightforward extension - another level of indirection, kind of the way we added subprojects), it would still be a rewrite of some core stuff. The actual number of lines might not be too horrid, but quite frankly, I wouldn't want to do it personally. It would be a lot of work with lots of careful special case handling - and no real upside for normal use. So I'm kind of down on it. I would suggest just admitting that git isn't very good at big individual files - especially not if you have a limited address space. So "don't do it then" or "make sure you are 64-bit and have lots of memory if you do it" may well be the right solution. [ And it's really really sad how Apple migrated to x86-32. It was totally unforgivably stupid, and I said so at the time. When Apple did the PowerPC -> x86 transition, they should have just transitioned to x86-64, and never had a 32-bit space. But Apple does stupid things, that seem to be driven by marketing rather than thinking deeply about the technology, and now they basically _have_ to default to that 32-bit environment. ] Oh well. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 16:59 ` Linus Torvalds @ 2009-05-27 17:22 ` Christopher Jefferson 2009-05-27 17:30 ` Jakub Narebski 2009-05-27 17:37 ` Nicolas Pitre 1 sibling, 1 reply; 31+ messages in thread From: Christopher Jefferson @ 2009-05-27 17:22 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jakub Narebski, git On 27 May 2009, at 17:59, Linus Torvalds wrote: > > > On Wed, 27 May 2009, Linus Torvalds wrote: >> >> I'll see if I can make us handle the "big file without diff" case >> better >> by chunking. > So "don't do it then" or "make sure you are 64-bit and have lots of > memory if you do it" may well be the right solution. Thank you for that description of the problem, I can see how hard it is. Perhaps it might be useful to think about how to codify "don't do it then" in a reasonably simple, automatic way? I've been trying to write a pre-commit hook (I think that's the right place?) which would refuse commits larger than some file size (512MB as a random number I decided), but am having trouble getting it to work right, and generally. Would such a thing be easy, and would that be the right place to put it? While I wouldn't suggest this become default, providing such a hook, and describing why you might want to use it, would seem to avoid the accidental part of the problem. Of course, people should really notice that they are submitting large files, but it's easy(ish) to commit some output file from a program, without realising the file ended up being the wrong side of 1GB. Chris ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 17:22 ` Christopher Jefferson @ 2009-05-27 17:30 ` Jakub Narebski 0 siblings, 0 replies; 31+ messages in thread From: Jakub Narebski @ 2009-05-27 17:30 UTC (permalink / raw) To: Christopher Jefferson; +Cc: Linus Torvalds, git On Wed, 27 May 2009, Christopher Jefferson wrote: > On 27 May 2009, at 17:59, Linus Torvalds wrote: >> On Wed, 27 May 2009, Linus Torvalds wrote: >>> >>> I'll see if I can make us handle the "big file without diff" case >>> better by chunking. >> So "don't do it then" or "make sure you are 64-bit and have lots of >> memory if you do it" may well be the right solution. > > Thank you for that description of the problem, I can see how hard it is. > > Perhaps it might be useful to think about how to codify "don't do it > then" in a reasonably simple, automatic way? > > I've been trying to write a pre-commit hook (I think that's the right > place?) which would refuse commits larger than some file size (512MB > as a random number I decided), but am having trouble getting it to > work right, and generally. Would such a thing be easy, and would that > be the right place to put it? > > While I wouldn't suggest this become default, providing such a hook, > and describing why you might want to use it, would seem to avoid the > accidental part of the problem. Hmmm... this is another issue (beside checking for portability of filenames) that would be neatly solved if there was 'pre-add' hook, rather than trying to use 'pre-commit' hook for that. It should not, I think, be that hard to add it... -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 16:59 ` Linus Torvalds 2009-05-27 17:22 ` Christopher Jefferson @ 2009-05-27 17:37 ` Nicolas Pitre 2009-05-27 21:53 ` Jeff King 1 sibling, 1 reply; 31+ messages in thread From: Nicolas Pitre @ 2009-05-27 17:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jakub Narebski, Christopher Jefferson, git On Wed, 27 May 2009, Linus Torvalds wrote: > Hmm. No. Looking at it some more, we could add some nasty code to do > _some_ things chunked (like adding a new file as a single object), but it > doesn't really help. For any kind of useful thing, we'd need to handle the > "read from pack" case in multiple chunks too, and that gets really nasty > really quickly. > > The whole "each object as one allocation" design is pretty core, and it > looks pointless to have a few special cases, when any actual relevant use > would need a whole lot more than the few simple ones. > > Git really doesn't like big individual objects. > > I've occasionally thought about handling big files as multiple big > objects: we'd split them into a "pseudo-directory" (it would have some new > object ID), and then treat them as a magical special kind of directory > that just happens to be represented as one large file on the filesystem. > > That would mean that if you have a huge file, git internally would never > think of it as one big file, but as a collection of many smaller objects. > By just making the point where you break up files be a consistent rule > ("always break into 256MB pieces"), it would be a well-behaved design (ie > things like behaviour convergence wrt the same big file being created > different ways). > > HOWEVER. > > While that would fit in the git design (ie it would be just a fairly > straightforward extension - another level of indirection, kind of the way > we added subprojects), it would still be a rewrite of some core stuff. The > actual number of lines might not be too horrid, but quite frankly, I > wouldn't want to do it personally. It would be a lot of work with lots of > careful special case handling - and no real upside for normal use. My idea for handling big files is simply to: 1) Define a new parameter to determine what is considered a big file. 2) Store any file larger than the treshold defined in (1) directly into a pack of their own at "git add" time. 3) Never attempt to diff nor delta large objects, again according to (1) above. It is typical for large files not to be deltifiable, and a diff for files in the thousands of megabytes cannot possibly be sane. The idea is to avoid ever needing to load such object's content entirely in memory. So with the data already in a pack, the pack data reuse logic (which already does its copy in chunks) could be triggered during a repack/fetch/push. This is also quite trivial to implement with very few special cases, and then git would handle huge repositories with lots of huge files just as well as any other SCMs. The usual git repository compactness won't be there of course, but I doubt people dealing with repositories in the hundreds of gigabytes really care. Nicolas ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 17:37 ` Nicolas Pitre @ 2009-05-27 21:53 ` Jeff King 2009-05-27 22:07 ` Linus Torvalds 2009-05-27 23:29 ` Nicolas Pitre 0 siblings, 2 replies; 31+ messages in thread From: Jeff King @ 2009-05-27 21:53 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git On Wed, May 27, 2009 at 01:37:26PM -0400, Nicolas Pitre wrote: > My idea for handling big files is simply to: > > 1) Define a new parameter to determine what is considered a big file. > > 2) Store any file larger than the treshold defined in (1) directly into > a pack of their own at "git add" time. > > 3) Never attempt to diff nor delta large objects, again according to > (1) above. It is typical for large files not to be deltifiable, and > a diff for files in the thousands of megabytes cannot possibly be > sane. What about large files that have a short metadata section that may change? Versions with only the metadata changed delta well, and with a custom diff driver, can produce useful diffs. And I don't think that is an impractical or unlikely example; large files can often be tagged media. Linus' "split into multiple objects" approach means you could perhaps split intelligently into metadata and "uninteresting data" sections based on the file type. That would make things like rename detection very fast. Of course it has the downside that you are cementing whatever split you made into history for all time. And it means that two people adding the same content might end up with different trees. Both things that git tries to avoid. I wonder if it would be useful to make such a split at _read_ time. That is, still refer to the sha-1 of the whole content in the tree objects, but have a separate cache that says "hash X splits to the concatenation of Y,Z". Thus you can always refer to the "pure" object, both as a user, and in the code. So we could avoid retrofitting all of the code -- just some parts like diff might want to handle an object in multiple segments. -Peff ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 21:53 ` Jeff King @ 2009-05-27 22:07 ` Linus Torvalds 2009-05-27 23:09 ` Alan Manuel Gloria 2009-05-28 19:43 ` Jeff King 2009-05-27 23:29 ` Nicolas Pitre 1 sibling, 2 replies; 31+ messages in thread From: Linus Torvalds @ 2009-05-27 22:07 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Jakub Narebski, Christopher Jefferson, git On Wed, 27 May 2009, Jeff King wrote: > > Linus' "split into multiple objects" approach means you could perhaps > split intelligently into metadata and "uninteresting data" sections > based on the file type. I suspect you wouldn't even need to. A regular delta algorithm would just work fairly well to find the common parts. Sure, if the offset of the data changes a lot, then you'd miss all the deltas between two (large) objects that now have data that traverses object boundaries, but especially if the split size is pretty large (ie several tens of MB, possibly something like 256M), that's still going to be a pretty rare event. IOW, imagine that you have a big file that is 2GB in size, and you prepend 100kB of data to it (that's why it's so big - you keep prepending data to it as some kind of odd ChangeLog file). What happens? It would still delta fairly well, even if the delta's would now be: - 100kB of new data - 256M - 100kB of old data as a small delta entry and the _next_ chunk woul be: - 100kB of "new" data (old data from the previous chunk) - 256M - 100kB of old data as a small delta entry .. and so on for each chunk. So if the whole file is 2GB, it would be roughly 8 256MB chunks, and it would delta perfectly well: except for the overlap, that would now be 8x 100kB "slop" deltas. So even a totally unmodified delta algorithm would shrink down the two copies of a ~2GB file to one copy + 900kB of extra delta. Sure, a perfect xdelta thing that would have treated it as one huge file would have had just 100kB of delta data, but 900kB would still be a *big* saving over duplicating the whole 2GB. > That would make things like rename detection very fast. Of course it has > the downside that you are cementing whatever split you made into history > for all time. And it means that two people adding the same content might > end up with different trees. Both things that git tries to avoid. It's the "I can no longer see that the files are the same by comparing SHA1's" that I personally dislike. So my "fixed chunk" approach would be nice in that if you have this kind of "chunkblob" entry, in the tree (and index) it would literally be one entry, and look like that: 100644 chunkblob <sha1> so you could compare two trees that have the same chunkblob entry, and just see that they are the same without ever looking at the (humongous) data. The <chunkblob> type itself would then look like just an array of SHA1's, ie it would literally be an object that only points to other blobs. Kind of a "simplified tree object", if you will. I think it would fit very well in the git model. But it's a nontrivial amount of changes. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 22:07 ` Linus Torvalds @ 2009-05-27 23:09 ` Alan Manuel Gloria 2009-05-28 1:56 ` Linus Torvalds 2009-05-28 19:43 ` Jeff King 1 sibling, 1 reply; 31+ messages in thread From: Alan Manuel Gloria @ 2009-05-27 23:09 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff King, Nicolas Pitre, Jakub Narebski, Christopher Jefferson, git On Thu, May 28, 2009 at 6:07 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Wed, 27 May 2009, Jeff King wrote: >> >> Linus' "split into multiple objects" approach means you could perhaps >> split intelligently into metadata and "uninteresting data" sections >> based on the file type. > > I suspect you wouldn't even need to. A regular delta algorithm would just > work fairly well to find the common parts. > > Sure, if the offset of the data changes a lot, then you'd miss all the > deltas between two (large) objects that now have data that traverses > object boundaries, but especially if the split size is pretty large (ie > several tens of MB, possibly something like 256M), that's still going to > be a pretty rare event. > > IOW, imagine that you have a big file that is 2GB in size, and you prepend > 100kB of data to it (that's why it's so big - you keep prepending data to > it as some kind of odd ChangeLog file). What happens? It would still delta > fairly well, even if the delta's would now be: > > - 100kB of new data > - 256M - 100kB of old data as a small delta entry > > and the _next_ chunk woul be: > > - 100kB of "new" data (old data from the previous chunk) > - 256M - 100kB of old data as a small delta entry > > .. and so on for each chunk. So if the whole file is 2GB, it would be > roughly 8 256MB chunks, and it would delta perfectly well: except for the > overlap, that would now be 8x 100kB "slop" deltas. > > So even a totally unmodified delta algorithm would shrink down the two > copies of a ~2GB file to one copy + 900kB of extra delta. > > Sure, a perfect xdelta thing that would have treated it as one huge file > would have had just 100kB of delta data, but 900kB would still be a *big* > saving over duplicating the whole 2GB. > >> That would make things like rename detection very fast. Of course it has >> the downside that you are cementing whatever split you made into history >> for all time. And it means that two people adding the same content might >> end up with different trees. Both things that git tries to avoid. > > It's the "I can no longer see that the files are the same by comparing > SHA1's" that I personally dislike. > > So my "fixed chunk" approach would be nice in that if you have this kind > of "chunkblob" entry, in the tree (and index) it would literally be one > entry, and look like that: > > 100644 chunkblob <sha1> > > so you could compare two trees that have the same chunkblob entry, and > just see that they are the same without ever looking at the (humongous) > data. > > The <chunkblob> type itself would then look like just an array of SHA1's, > ie it would literally be an object that only points to other blobs. Kind > of a "simplified tree object", if you will. > > I think it would fit very well in the git model. But it's a nontrivial > amount of changes. > > Linus I'd like to pitch in that our mother company uses Subversion, and they consistently push very large binaries onto their Subvesion repositories (I know it's not a good idea. They do it nevertheless. The very large binary is a description of a design in a proprietary format by a proprietary tool; they don't want to keep running that tool because of licensing etc issues, so they archive it on Subversion). I'm trying to convince the mother company to switch to git, mostly because our company (the daughter company) doesn't have direct access to their Subversion repo (we're in another country), and I've become convinced that distributed repos like git are the way to go. But the fact that large binaries require me to turn off gc.auto and otherwise avoid packing large filles makes my case a harder sell; quite a bit of the mother company's workflow has been integrated with Subversion. Note that in my case "large binary" is really a 164Mb file, but my work system is a dual-core 512Mb computer, so I suppose my hardware is really the limitation; still, some of the computers at the mother company are even lousier. If you'd prefer someone else to hack it, can you at least give me some pointers on which code files to start looking? I'd really like to have proper large-file-packing support, where large file is anything much bigger than a megabyte or so. Admittedly I'm not a filesystems guy and I can just barely grok git's blobs (they're the actual files, right? except they're named with their hash), but not packs (err, a bunch of files?) and trees (brown and green stuff you plant?). Still, I can try to learn it. Sincerely, AmkG ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 23:09 ` Alan Manuel Gloria @ 2009-05-28 1:56 ` Linus Torvalds 2009-05-28 3:26 ` Nicolas Pitre 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2009-05-28 1:56 UTC (permalink / raw) To: Alan Manuel Gloria Cc: Jeff King, Nicolas Pitre, Jakub Narebski, Christopher Jefferson, git On Thu, 28 May 2009, Alan Manuel Gloria wrote: > > If you'd prefer someone else to hack it, can you at least give me some > pointers on which code files to start looking? I'd really like to > have proper large-file-packing support, where large file is anything > much bigger than a megabyte or so. > > Admittedly I'm not a filesystems guy and I can just barely grok git's > blobs (they're the actual files, right? except they're named with > their hash), but not packs (err, a bunch of files?) and trees (brown > and green stuff you plant?). Still, I can try to learn it. The packs is a big part of the complexity. If you were to keep the big files as unpacked blobs, that would be fairly simple - but the pack-file format is needed for fetching and pushing things, so it's not really an option. For your particular case, the simplest approach is probably to just limit the delta search. Something like just saying "if the object is larger than X, don't even bother to try to delta it, and just pack it without delta compression". The code would still load that whole object in one go, but it sounds like you can handle _one_ object at a time. So for your case, I don't think you need a fundamental git change - you'd be ok with just an inefficient pack format for large files that are very expensive to pack otherwise. You can already do that by using .gitattributes to not delta entries by name, but maybe it's worth doing explicitly by size too. I realize that the "delta" attribute is apparently almost totally undocumented. But if your big blobs have a particular name pattern, what you should try is to do something like - in your '.gitattributes' file (or .git/info/attributes if you don't want to check it in), add a line like *.img !delta which now sets the 'delta' attribute to false for all objects that match the '*.img' pattern. - see if pack creation is now acceptable (ie do a "git gc" or try to push somewhere) Something like the following may also work, as a more generic "just don't even bother trying to delta huge files". Totally untested. Maybe it works. Maybe it doesn't. Linus --- Documentation/config.txt | 7 +++++++ builtin-pack-objects.c | 9 +++++++++ 2 files changed, 16 insertions(+), 0 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index 2c03162..8c21027 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -1238,6 +1238,13 @@ older version of git. If the `{asterisk}.pack` file is smaller than 2 GB, howeve you can use linkgit:git-index-pack[1] on the *.pack file to regenerate the `{asterisk}.idx` file. +pack.packDeltaLimit:: + The default maximum size of objects that we try to delta. ++ +Big files can be very expensive to delta, and if they are large binary +blobs, there is likely little upside to it anyway. So just pack them +as-is, and don't waste time on them. + pack.packSizeLimit:: The default maximum size of a pack. This setting only affects packing to a file, i.e. the git:// protocol is unaffected. It diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 9742b45..9a0072b 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -85,6 +85,7 @@ static struct progress *progress_state; static int pack_compression_level = Z_DEFAULT_COMPRESSION; static int pack_compression_seen; +static unsigned long pack_delta_limit = 64*1024*1024; static unsigned long delta_cache_size = 0; static unsigned long max_delta_cache_size = 0; static unsigned long cache_max_small_delta_size = 1000; @@ -1270,6 +1271,10 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, if (trg_entry->type != src_entry->type) return -1; + /* If we limit delta generation, don't even bother for larger blobs */ + if (pack_delta_limit && trg_entry->size >= pack_delta_limit) + return -1; + /* * We do not bother to try a delta that we discarded * on an earlier try, but only when reusing delta data. @@ -1865,6 +1870,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) pack_size_limit_cfg = git_config_ulong(k, v); return 0; } + if (!strcmp(k, "pack.packdeltalimit")) { + pack_delta_limit = git_config_ulong(k, v); + return 0; + } return git_default_config(k, v, cb); } ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-28 1:56 ` Linus Torvalds @ 2009-05-28 3:26 ` Nicolas Pitre 2009-05-28 4:21 ` Eric Raible 0 siblings, 1 reply; 31+ messages in thread From: Nicolas Pitre @ 2009-05-28 3:26 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Manuel Gloria, Jeff King, Jakub Narebski, Christopher Jefferson, git On Wed, 27 May 2009, Linus Torvalds wrote: > Something like the following may also work, as a more generic "just don't > even bother trying to delta huge files". > > Totally untested. Maybe it works. Maybe it doesn't. > > Linus > > --- > Documentation/config.txt | 7 +++++++ > builtin-pack-objects.c | 9 +++++++++ > 2 files changed, 16 insertions(+), 0 deletions(-) > > diff --git a/Documentation/config.txt b/Documentation/config.txt > index 2c03162..8c21027 100644 > --- a/Documentation/config.txt > +++ b/Documentation/config.txt > @@ -1238,6 +1238,13 @@ older version of git. If the `{asterisk}.pack` file is smaller than 2 GB, howeve > you can use linkgit:git-index-pack[1] on the *.pack file to regenerate > the `{asterisk}.idx` file. > > +pack.packDeltaLimit:: > + The default maximum size of objects that we try to delta. The option name feels a bit wrong here, like if it meant the max number of deltas in a pack. Nothing better comes to my mind at the moment though. > diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c > index 9742b45..9a0072b 100644 > --- a/builtin-pack-objects.c > +++ b/builtin-pack-objects.c > @@ -85,6 +85,7 @@ static struct progress *progress_state; > static int pack_compression_level = Z_DEFAULT_COMPRESSION; > static int pack_compression_seen; > > +static unsigned long pack_delta_limit = 64*1024*1024; > static unsigned long delta_cache_size = 0; > static unsigned long max_delta_cache_size = 0; > static unsigned long cache_max_small_delta_size = 1000; > @@ -1270,6 +1271,10 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, > if (trg_entry->type != src_entry->type) > return -1; > > + /* If we limit delta generation, don't even bother for larger blobs */ > + if (pack_delta_limit && trg_entry->size >= pack_delta_limit) > + return -1; I'd suggest filtering delta candidates out of delta_list up front in prepare_pack() instead. Nicolas ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-28 3:26 ` Nicolas Pitre @ 2009-05-28 4:21 ` Eric Raible 2009-05-28 4:30 ` Shawn O. Pearce 2009-05-28 17:41 ` Nicolas Pitre 0 siblings, 2 replies; 31+ messages in thread From: Eric Raible @ 2009-05-28 4:21 UTC (permalink / raw) To: git Nicolas Pitre <nico <at> cam.org> writes: > On Wed, 27 May 2009, Linus Torvalds wrote: > > > +pack.packDeltaLimit:: > > + The default maximum size of objects that we try to delta. > > The option name feels a bit wrong here, like if it meant the max number > of deltas in a pack. Nothing better comes to my mind at the moment > though. pack.maxDeltaSize sounds weird when said aloud. How about pack.deltaMaxSize? - Eric ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-28 4:21 ` Eric Raible @ 2009-05-28 4:30 ` Shawn O. Pearce 2009-05-28 5:52 ` Eric Raible 2009-05-28 17:41 ` Nicolas Pitre 1 sibling, 1 reply; 31+ messages in thread From: Shawn O. Pearce @ 2009-05-28 4:30 UTC (permalink / raw) To: Eric Raible; +Cc: git Eric Raible <raible@gmail.com> wrote: > Nicolas Pitre <nico <at> cam.org> writes: > > On Wed, 27 May 2009, Linus Torvalds wrote: > > > > > +pack.packDeltaLimit:: > > > + The default maximum size of objects that we try to delta. > > > > The option name feels a bit wrong here, like if it meant the max number > > of deltas in a pack. Nothing better comes to my mind at the moment > > though. > > pack.maxDeltaSize sounds weird when said aloud. > How about pack.deltaMaxSize? That sounds like, how big should a delta be? E.g. set it to 200 and any delta instruction stream over 200 bytes would be discarded, causing the whole object to be stored instead. Which is obviously somewhat silly, but that's the way I'd read that option... -- Shawn. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-28 4:30 ` Shawn O. Pearce @ 2009-05-28 5:52 ` Eric Raible 2009-05-28 8:52 ` Andreas Ericsson 0 siblings, 1 reply; 31+ messages in thread From: Eric Raible @ 2009-05-28 5:52 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: git On Wed, May 27, 2009 at 9:30 PM, Shawn O. Pearce <spearce@spearce.org> wrote: > Eric Raible <raible@gmail.com> wrote: >> Nicolas Pitre <nico <at> cam.org> writes: >> > On Wed, 27 May 2009, Linus Torvalds wrote: >> > >> > > +pack.packDeltaLimit:: >> > > + The default maximum size of objects that we try to delta. >> > >> > The option name feels a bit wrong here, like if it meant the max number >> > of deltas in a pack. Nothing better comes to my mind at the moment >> > though. >> >> pack.maxDeltaSize sounds weird when said aloud. >> How about pack.deltaMaxSize? > > That sounds like, how big should a delta be? E.g. set it to 200 > and any delta instruction stream over 200 bytes would be discarded, > causing the whole object to be stored instead. Which is obviously > somewhat silly, but that's the way I'd read that option... > > -- > Shawn. You're right, that _is_ a strange color for the bike shed... ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-28 5:52 ` Eric Raible @ 2009-05-28 8:52 ` Andreas Ericsson 0 siblings, 0 replies; 31+ messages in thread From: Andreas Ericsson @ 2009-05-28 8:52 UTC (permalink / raw) To: Eric Raible; +Cc: Shawn O. Pearce, git Eric Raible wrote: > On Wed, May 27, 2009 at 9:30 PM, Shawn O. Pearce <spearce@spearce.org> wrote: >> Eric Raible <raible@gmail.com> wrote: >>> Nicolas Pitre <nico <at> cam.org> writes: >>>> On Wed, 27 May 2009, Linus Torvalds wrote: >>>> >>>>> +pack.packDeltaLimit:: >>>>> + The default maximum size of objects that we try to delta. >>>> The option name feels a bit wrong here, like if it meant the max number >>>> of deltas in a pack. Nothing better comes to my mind at the moment >>>> though. >>> pack.maxDeltaSize sounds weird when said aloud. >>> How about pack.deltaMaxSize? >> That sounds like, how big should a delta be? E.g. set it to 200 >> and any delta instruction stream over 200 bytes would be discarded, >> causing the whole object to be stored instead. Which is obviously >> somewhat silly, but that's the way I'd read that option... >> >> -- >> Shawn. > > You're right, that _is_ a strange color for the bike shed... Since 'delta' names both the action and the result of the action, it's tricky to get it unambiguous without helping the grammar along a little. pack.maxFileSizeToDelta is probably the shortest we're going to get it while avoiding ambiguity. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-28 4:21 ` Eric Raible 2009-05-28 4:30 ` Shawn O. Pearce @ 2009-05-28 17:41 ` Nicolas Pitre 1 sibling, 0 replies; 31+ messages in thread From: Nicolas Pitre @ 2009-05-28 17:41 UTC (permalink / raw) To: Eric Raible; +Cc: git [ please don't drop original sender address and CC unless asked to ] On Thu, 28 May 2009, Eric Raible wrote: > Nicolas Pitre <nico <at> cam.org> writes: > > > On Wed, 27 May 2009, Linus Torvalds wrote: > > > > > +pack.packDeltaLimit:: > > > + The default maximum size of objects that we try to delta. > > > > The option name feels a bit wrong here, like if it meant the max number > > of deltas in a pack. Nothing better comes to my mind at the moment > > though. > > pack.maxDeltaSize sounds weird when said aloud. > How about pack.deltaMaxSize? pack.MaxSizeForDelta Whatever... Nicolas ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 22:07 ` Linus Torvalds 2009-05-27 23:09 ` Alan Manuel Gloria @ 2009-05-28 19:43 ` Jeff King 2009-05-28 19:49 ` Linus Torvalds 1 sibling, 1 reply; 31+ messages in thread From: Jeff King @ 2009-05-28 19:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, Jakub Narebski, Christopher Jefferson, git On Wed, May 27, 2009 at 03:07:49PM -0700, Linus Torvalds wrote: > I suspect you wouldn't even need to. A regular delta algorithm would just > work fairly well to find the common parts. > > Sure, if the offset of the data changes a lot, then you'd miss all the > deltas between two (large) objects that now have data that traverses > object boundaries, but especially if the split size is pretty large (ie > several tens of MB, possibly something like 256M), that's still going to > be a pretty rare event. I confess that I'm not just interested in the _size_ of the deltas, but also speeding up deltification and rename detection. And I'm interested in files where we can benefit from their semantics a bit. So yes, with some overlap you would end up with pretty reasonable deltas for arbitrary binary, as you describe. But I was thinking something more like splitting a JPEG into a small first chunk that contains EXIF data, and a big secondary chunk that contains the actual image data. The second half is marked as not compressible (since it is already lossily compressed), and not interesting for deltification. When we consider two images for deltification, either: 1. they have the same "uninteresting" big part. In that case, you can trivially make a delta by just replacing the smaller first part (or even finding the optimal delta between the small parts). You never even need to look at the second half. 2. they don't have the same uninteresting part. You can reject them as delta candidates, because there is little chance the big parts will be related, even for a different version of the same image. And that extends to rename detection, as well. You can avoid looking at the big part at all if you assume big parts with differing hashes are going to be drastically different. > > That would make things like rename detection very fast. Of course it has > > the downside that you are cementing whatever split you made into history > > for all time. And it means that two people adding the same content might > > end up with different trees. Both things that git tries to avoid. > > It's the "I can no longer see that the files are the same by comparing > SHA1's" that I personally dislike. Right. I don't think splitting in the git data structure itself is worth it for that reason. But deltification and rename detection keeping a cache of smart splits that says "You can represent <sha-1> as this concatenation of <sha-1>s" means they can still get some advantage (over multiple runs, certainly, but possibly even over a single run: a smart splitter might not even have to look at the entire file contents). > So my "fixed chunk" approach would be nice in that if you have this kind > of "chunkblob" entry, in the tree (and index) it would literally be one > entry, and look like that: > > 100644 chunkblob <sha1> But if I am understanding you correctly, you _are_ proposing to munge the git data structure here. Which means that pre-chunkblob trees will point to the raw blob, and then post-chunkblob trees will point to the chunked representation. And that means not being able to use the sha-1 to see that they eventually point to the same content. -Peff ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-28 19:43 ` Jeff King @ 2009-05-28 19:49 ` Linus Torvalds 0 siblings, 0 replies; 31+ messages in thread From: Linus Torvalds @ 2009-05-28 19:49 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Jakub Narebski, Christopher Jefferson, git On Thu, 28 May 2009, Jeff King wrote: > > > So my "fixed chunk" approach would be nice in that if you have this kind > > of "chunkblob" entry, in the tree (and index) it would literally be one > > entry, and look like that: > > > > 100644 chunkblob <sha1> > > But if I am understanding you correctly, you _are_ proposing to munge > the git data structure here. Which means that pre-chunkblob trees will > point to the raw blob, and then post-chunkblob trees will point to the > chunked representation. And that means not being able to use the sha-1 > to see that they eventually point to the same content. Yes. If we were to do this, and people have large chunks, then once you start using the chunkblob (for lack of a better word) model, you'll see the same object with two different SHA1's. But it's a one-time (and one-way - since once it's a chunkblob, older models can't touch it) thing, it can never cause any long-term confusion. (We'll end up with something similar if somebody ever breaks SHA-1 enough for us to care - the logical way to handle it is likely to just accept the SHA512-160 object name "aliases") Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 21:53 ` Jeff King 2009-05-27 22:07 ` Linus Torvalds @ 2009-05-27 23:29 ` Nicolas Pitre 2009-05-28 20:00 ` Jeff King 1 sibling, 1 reply; 31+ messages in thread From: Nicolas Pitre @ 2009-05-27 23:29 UTC (permalink / raw) To: Jeff King; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git On Wed, 27 May 2009, Jeff King wrote: > On Wed, May 27, 2009 at 01:37:26PM -0400, Nicolas Pitre wrote: > > > My idea for handling big files is simply to: > > > > 1) Define a new parameter to determine what is considered a big file. > > > > 2) Store any file larger than the treshold defined in (1) directly into > > a pack of their own at "git add" time. > > > > 3) Never attempt to diff nor delta large objects, again according to > > (1) above. It is typical for large files not to be deltifiable, and > > a diff for files in the thousands of megabytes cannot possibly be > > sane. > > What about large files that have a short metadata section that may > change? Versions with only the metadata changed delta well, and with a > custom diff driver, can produce useful diffs. And I don't think that is > an impractical or unlikely example; large files can often be tagged > media. Sure... but what is the actual data pattern currently used out there? What does P4 or CVS or SVN do with multiple versions of almost identical 2GB+ files? My point is, if the tool people are already using with gigantic repositories is not bothering with delta compression then we don't lose much in making git usable with those repositories by doing the same. And this can be achieved pretty easily with fairly minor changes. Plus, my proposal doesn't introduce any incompatibility in the git repository format while not denying possible future enhancements. For example, that would be non trivial but still doable to make git work on data streams instead of buffers. The current code for blob read/write/delta could be kept for performance, along with another version in parallel doing the same but with file descriptors and pread/pwrite for big files. > Linus' "split into multiple objects" approach means you could perhaps > split intelligently into metadata and "uninteresting data" sections > based on the file type. That would make things like rename detection > very fast. Of course it has the downside that you are cementing whatever > split you made into history for all time. And it means that two people > adding the same content might end up with different trees. Both things > that git tries to avoid. Exact. And honnestly I don't think that would be worth trying to do inexact rename detection for huge files anyway. It is rarely the case that moving/renaming a movie file needs to change its content in some way. > I wonder if it would be useful to make such a split at _read_ time. That > is, still refer to the sha-1 of the whole content in the tree objects, > but have a separate cache that says "hash X splits to the concatenation > of Y,Z". Thus you can always refer to the "pure" object, both as a user, > and in the code. So we could avoid retrofitting all of the code -- just > some parts like diff might want to handle an object in multiple > segments. Unless there are real world scenarios where diffing (as we know it) two huge files is a common and useful operation, I don't think we should even try to consider that problem. What people are doing with huge files is storing them and retrieving them, so we probably should only limit ourselves to making those operations work for now. And to that effect I don't think it would be wise to introduce artificial segmentations in the object structure that would make both the code and the git model more complex. We could just as well limit the complexity to the code for dealing with blobs without having to load them all in memory at once instead and keep the git repository model simple. So if we want to do the real thing and deal with huge blobs, there is only a small set of operations that need to be considered: - Creation of new blobs (or "git add") for huge files: can be done trivially in chunks. Open issue is whether or not the SHA1 of the file is computed with a first pass over the file, and if the object doesn't exist then perform a second pass to deflate it if desired, or do both the SHA1 summing and deflate in the same pass and discard the result if the object happens to already exist. Still trivial to implement. - Checkout of huge files: still trivial to perform if non delta. In the delta case, that _could_ be quite simple if the base objects were not deflated by recursively parsing deltas. But again that remains to be seen if 1) deflating or even 2) deltifying huge files is useful in practice with real world data. - repack/fetch/pull: In the pack data reuse case, the code is already fine as it streams small blocks from the source to the destination. Delta compression can be done by using coarse indexing of the source object and loading/discarding portions of the source data while the target object is processed in a streaming fashion. Other than that, I don't see how git could be useful for huge files. The above operations (read/write/delta of huge blobs) would need to be done with a separate set of functions, and a configurable size treshold would select the regular or the chunked set. Nothing fundamentally difficult in my mind. Nicolas ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-27 23:29 ` Nicolas Pitre @ 2009-05-28 20:00 ` Jeff King 2009-05-28 20:54 ` Nicolas Pitre 0 siblings, 1 reply; 31+ messages in thread From: Jeff King @ 2009-05-28 20:00 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git On Wed, May 27, 2009 at 07:29:02PM -0400, Nicolas Pitre wrote: > > What about large files that have a short metadata section that may > > change? Versions with only the metadata changed delta well, and with a > > custom diff driver, can produce useful diffs. And I don't think that is > > an impractical or unlikely example; large files can often be tagged > > media. > > Sure... but what is the actual data pattern currently used out there? I'm not sure what you mean by "out there", but I just exactly described the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and short (a few dozens of megabytes) AVIs, frequent additions, infrequently changing photo contents, and moderately changing metadata). I don't know how that matches other peoples' needs. Game designers have been mentioned before in large media checkins, and I think they focus less on metadata changes. Media is either there to stay, or it is replaced as a whole. > What does P4 or CVS or SVN do with multiple versions of almost identical > 2GB+ files? I only ever tried this with CVS, which just stored the entire binary version as a whole. And of course running "diff" was useless, but then it was also useless on text files. ;) I suspect CVS would simply choke on a 2G file. But I don't want to do as well as those other tools. I want to be able to do all of the useful things git can do but with large files. > My point is, if the tool people are already using with gigantic > repositories is not bothering with delta compression then we don't lose > much in making git usable with those repositories by doing the same. > And this can be achieved pretty easily with fairly minor changes. Plus, > my proposal doesn't introduce any incompatibility in the git repository > format while not denying possible future enhancements. Right. I think in some ways we are perhaps talking about two different problems. I am really interested in moderately large files (a few megabytes up to a few dozens or even a hundred megabytes), but I want git to be _fast_ at dealing with them, and doing useful operations on them (like rename detection, diffing, etc). A smart splitter would probably want to mark part of the split as "this section is large and uninteresting for compression, deltas, diffing, and renames". And that half may be stored in the way that you are proposing (in a separate single-object pack, no compression, no delta, etc). So in a sense I think what I am talking about would built on top of what you want to do. > > very fast. Of course it has the downside that you are cementing whatever > > split you made into history for all time. And it means that two people > > adding the same content might end up with different trees. Both things > > that git tries to avoid. > > Exact. And honnestly I don't think that would be worth trying to do > inexact rename detection for huge files anyway. It is rarely the case > that moving/renaming a movie file needs to change its content in some > way. I should have been more clear in my other email: I think splitting that is represented in the actual git trees is not going to be worth the hassle. But I do think we can get some of the benefits by maintaining a split cache for viewers. And again, maybe my use case is crazy, but in my repo I have renames and metadata content changes together. > Unless there are real world scenarios where diffing (as we know it) two > huge files is a common and useful operation, I don't think we should > even try to consider that problem. What people are doing with huge > files is storing them and retrieving them, so we probably should only > limit ourselves to making those operations work for now. Again, this is motivated by a real use case that I have. > And to that effect I don't think it would be wise to introduce > artificial segmentations in the object structure that would make both > the code and the git model more complex. We could just as well limit > the complexity to the code for dealing with blobs without having to load > them all in memory at once instead and keep the git repository model > simple. I do agree with this; I don't want to make any changes to the repository model. > So if we want to do the real thing and deal with huge blobs, there is > only a small set of operations that need to be considered: I think everything you say here is sensible; I just want more operations for my use case. -Peff ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-28 20:00 ` Jeff King @ 2009-05-28 20:54 ` Nicolas Pitre 2009-05-28 21:21 ` Jeff King 0 siblings, 1 reply; 31+ messages in thread From: Nicolas Pitre @ 2009-05-28 20:54 UTC (permalink / raw) To: Jeff King; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 3519 bytes --] On Thu, 28 May 2009, Jeff King wrote: > On Wed, May 27, 2009 at 07:29:02PM -0400, Nicolas Pitre wrote: > > > > What about large files that have a short metadata section that may > > > change? Versions with only the metadata changed delta well, and with a > > > custom diff driver, can produce useful diffs. And I don't think that is > > > an impractical or unlikely example; large files can often be tagged > > > media. > > > > Sure... but what is the actual data pattern currently used out there? > > I'm not sure what you mean by "out there", but I just exactly described > the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and > short (a few dozens of megabytes) AVIs, frequent additions, infrequently > changing photo contents, and moderately changing metadata). I don't know > how that matches other peoples' needs. How do diffing à la 'git diff' JPEGs or AVIs make sense? Also, you certainly have little to delta against as you add new photos more often than modifying existing ones? > Game designers have been mentioned before in large media checkins, and I > think they focus less on metadata changes. Media is either there to > stay, or it is replaced as a whole. Right. And my proposal fits that scenario pretty well. > > What does P4 or CVS or SVN do with multiple versions of almost identical > > 2GB+ files? > > I only ever tried this with CVS, which just stored the entire binary > version as a whole. And of course running "diff" was useless, but then > it was also useless on text files. ;) I suspect CVS would simply choke > on a 2G file. > > But I don't want to do as well as those other tools. I want to be able > to do all of the useful things git can do but with large files. Right now git simply do much worse. So doing as well is still a worthy goal. > Right. I think in some ways we are perhaps talking about two different > problems. I am really interested in moderately large files (a few > megabytes up to a few dozens or even a hundred megabytes), but I want > git to be _fast_ at dealing with them, and doing useful operations on > them (like rename detection, diffing, etc). I still can't see how diffing big files is useful. Certainly you'll need a specialized external diff tool, in which case it is not git's problem anymore except for writing content to temporary files. Rename detection: either you deal with the big files each time, or you (re)create a cache with that information so no analysis is needed the second time around. This is something that even small files might possibly benefit from. But in any case, there is no other ways but to bite the bullet at least initially, and big files will be slower to process no matter what. > A smart splitter would probably want to mark part of the split as "this > section is large and uninteresting for compression, deltas, diffing, and > renames". And that half may be stored in the way that you are proposing > (in a separate single-object pack, no compression, no delta, etc). So in > a sense I think what I am talking about would built on top of what you > want to do. Looks to me like you wish for git to do what a specialized database would be much more suited for the task. Isn't there tools to gather picture metadata info, just like itunes does with MP3s already? > But I do think we can get some of the benefits by maintaining a split > cache for viewers. Sure. But being able to deal with large (1GB and more) files remains a totally different problem. Nicolas ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Problem with large files on different OSes 2009-05-28 20:54 ` Nicolas Pitre @ 2009-05-28 21:21 ` Jeff King 0 siblings, 0 replies; 31+ messages in thread From: Jeff King @ 2009-05-28 21:21 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Jakub Narebski, Christopher Jefferson, git On Thu, May 28, 2009 at 04:54:28PM -0400, Nicolas Pitre wrote: > > I'm not sure what you mean by "out there", but I just exactly described > > the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and > > short (a few dozens of megabytes) AVIs, frequent additions, infrequently > > changing photo contents, and moderately changing metadata). I don't know > > how that matches other peoples' needs. > > How do diffing à la 'git diff' JPEGs or AVIs make sense? It is useful to see the changes in a text representation of the metadata, with a single-line mention if the image or movie data has changed. It's why I wrote the textconv feature. > Also, you certainly have little to delta against as you add new photos > more often than modifying existing ones? I do add new photos more than modifying existing ones. But I do modify the old ones (tag corrections, new tags I didn't think of initially, updates to the tagging schema, etc), too. The sum of the sizes for all objects in the repo is 8.3G. The fully packed repo is 3.3G. So there clearly is some benefit from deltas, and I don't want to just turn them off. Doing the actual repack is painfully slow. > I still can't see how diffing big files is useful. Certainly you'll > need a specialized external diff tool, in which case it is not git's > problem anymore except for writing content to temporary files. Writing content to temporarily files is actually quite slow when the files are hundreds of megabytes (even "git show" can be painful, let alone "git log -p"). But that is something that can be dealt with by improving the interface to external diff and textconv to avoid writing out the whole file (and is something I have patches in the works for, but they need finished and cleaned up). > Rename detection: either you deal with the big files each time, or you > (re)create a cache with that information so no analysis is needed the > second time around. This is something that even small files might > possibly benefit from. But in any case, there is no other ways but to > bite the bullet at least initially, and big files will be slower to > process no matter what. Right. What I am proposing is basically to create such a cache. But it is one that is general enough that it could be used for more than just the rename detection (though arguably rename detection and deltification could actually share more of the same techniques, in which case a cache for one would help the other). > Looks to me like you wish for git to do what a specialized database > would be much more suited for the task. Isn't there tools to gather > picture metadata info, just like itunes does with MP3s already? Yes, I already have tools for handling picture metadata info. How do I version control that information? How do I keep it in sync across multiple checkouts? How do I handle merging concurrent changes from multiple sources? How do I keep that metadata connected to the pictures that it describes? The things I want to do are conceptually no different what I do with other files; it's merely the size of the files that makes working with them in git less convenient (but it does _work_; I am using git for this _now_, and I have been for a few years). > But being able to deal with large (1GB and more) files remains a totally > different problem. Right, that is why I think I will end up building on top of what you do. I am trying to make a way for some operations to avoid looking at the entire file, even streaming, which should drastically speed up those operations. But it is unavoidable that some operations (e.g., "git add") will have to look at the entire file. And that is what your proposal is about; streaming is basically the only way forward there. -Peff ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2009-05-28 21:21 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson 2009-05-27 11:37 ` Andreas Ericsson 2009-05-27 13:02 ` Christopher Jefferson 2009-05-27 13:28 ` John Tapsell 2009-05-27 13:30 ` Christopher Jefferson 2009-05-27 13:32 ` John Tapsell 2009-05-27 14:01 ` Tomas Carnecky 2009-05-27 14:09 ` Christopher Jefferson 2009-05-27 14:22 ` Andreas Ericsson 2009-05-27 14:37 ` Jakub Narebski 2009-05-27 16:30 ` Linus Torvalds 2009-05-27 16:59 ` Linus Torvalds 2009-05-27 17:22 ` Christopher Jefferson 2009-05-27 17:30 ` Jakub Narebski 2009-05-27 17:37 ` Nicolas Pitre 2009-05-27 21:53 ` Jeff King 2009-05-27 22:07 ` Linus Torvalds 2009-05-27 23:09 ` Alan Manuel Gloria 2009-05-28 1:56 ` Linus Torvalds 2009-05-28 3:26 ` Nicolas Pitre 2009-05-28 4:21 ` Eric Raible 2009-05-28 4:30 ` Shawn O. Pearce 2009-05-28 5:52 ` Eric Raible 2009-05-28 8:52 ` Andreas Ericsson 2009-05-28 17:41 ` Nicolas Pitre 2009-05-28 19:43 ` Jeff King 2009-05-28 19:49 ` Linus Torvalds 2009-05-27 23:29 ` Nicolas Pitre 2009-05-28 20:00 ` Jeff King 2009-05-28 20:54 ` Nicolas Pitre 2009-05-28 21:21 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).