* Cygwin can't handle huge packfiles? @ 2006-04-03 9:46 Kees-Jan Dijkzeul 2006-04-03 13:23 ` Johannes Schindelin 2006-04-03 14:38 ` Alex Riesen 0 siblings, 2 replies; 21+ messages in thread From: Kees-Jan Dijkzeul @ 2006-04-03 9:46 UTC (permalink / raw) To: git Hi, I'm trying to get Git to manage a 5Gb source tree. Under linux, this works like a charm. Under cygwin, however, I run in to difficulties. For example: $ git-clone sgp-wa/ sgp-wa.clone fatal: packfile ./objects/pack/pack-56aa013a0234e198467ed37ae5db925764a6ee98.pack cannot be mapped. fatal: unexpected EOF fetch-pack from '/cygdrive/e/Projects/sgp-wa/.git' failed. To figure out what is happening, I printed the value of errno, which turns out to be 12 (Cannot allocate memory). I'm not sure how mmap is implemented in cygwin, but if they allocate memory and load the file into it, then this error is not surprising, as the pack file in question is 1.5Gb in size. I'm not sure how to approach this problem. Any tips would be greatly appreciated. Thanks a lot! Kees-Jan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-03 9:46 Cygwin can't handle huge packfiles? Kees-Jan Dijkzeul @ 2006-04-03 13:23 ` Johannes Schindelin 2006-04-03 14:26 ` Morten Welinder 2006-04-03 14:33 ` Linus Torvalds 2006-04-03 14:38 ` Alex Riesen 1 sibling, 2 replies; 21+ messages in thread From: Johannes Schindelin @ 2006-04-03 13:23 UTC (permalink / raw) To: Kees-Jan Dijkzeul; +Cc: git Hi, On Mon, 3 Apr 2006, Kees-Jan Dijkzeul wrote: > I'm trying to get Git to manage a 5Gb source tree. Under linux, this > works like a charm. Under cygwin, however, I run in to difficulties. > For example: > > $ git-clone sgp-wa/ sgp-wa.clone > fatal: packfile > ./objects/pack/pack-56aa013a0234e198467ed37ae5db925764a6ee98.pack > cannot be mapped. > fatal: unexpected EOF > fetch-pack from '/cygdrive/e/Projects/sgp-wa/.git' failed. > > To figure out what is happening, I printed the value of errno, which > turns out to be 12 (Cannot allocate memory). I'm not sure how mmap is > implemented in cygwin, but if they allocate memory and load the file > into it, then this error is not surprising, as the pack file in > question is 1.5Gb in size. The problem is not mmap() on cygwin, but that a fork() has to jump through loops to reinstall the open file descriptors on cygwin. If the corresponding file was deleted, that fails. Therefore, we work around that on cygwin by actually reading the file into memory, *not* mmap()ing it. Hth, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-03 13:23 ` Johannes Schindelin @ 2006-04-03 14:26 ` Morten Welinder 2006-04-03 14:33 ` Linus Torvalds 1 sibling, 0 replies; 21+ messages in thread From: Morten Welinder @ 2006-04-03 14:26 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Kees-Jan Dijkzeul, git > The problem is not mmap() on cygwin, but that a fork() has to jump through > loops to reinstall the open file descriptors on cygwin. If the > corresponding file was deleted, that fails. Therefore, we work around that > on cygwin by actually reading the file into memory, *not* mmap()ing it. Maybe, but you aren't going to be able to handler much bigger packs even on *nix. Unless you go 64-bit, that is. M. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-03 13:23 ` Johannes Schindelin 2006-04-03 14:26 ` Morten Welinder @ 2006-04-03 14:33 ` Linus Torvalds 2006-04-03 14:36 ` Linus Torvalds 2006-04-03 15:12 ` Johannes Schindelin 1 sibling, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2006-04-03 14:33 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Kees-Jan Dijkzeul, git On Mon, 3 Apr 2006, Johannes Schindelin wrote: > > The problem is not mmap() on cygwin, but that a fork() has to jump through > loops to reinstall the open file descriptors on cygwin. If the > corresponding file was deleted, that fails. Therefore, we work around that > on cygwin by actually reading the file into memory, *not* mmap()ing it. Well, we could actually do a _real_ mmap on pack-files. The pack-files are much better mmap'ed - there we don't _want_ them to be removed while we're using them. It was the index file etc that was problematic. Maybe the cygwin fake mmap should be triggered only for the index (and possibly the individual objects - if only because there doing a malloc+read may actually be faster). Using malloc+read on pack-files is pretty wasteful, since we usually only use a very small part of them (ie if we have a 1.5GB pack-file, it's sad to read all of it, when we'd usually actually access just a small small fraction of it). That said, I think git _does_ have problems with large pack-files. We have some 32-bit issues etc, and just virtual address space things. So for now, it's probably best to limit pack-files to the few-hundred-meg size, and create serveral smaller ones rather than one huge one. Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-03 14:33 ` Linus Torvalds @ 2006-04-03 14:36 ` Linus Torvalds 2006-04-05 13:24 ` Kees-Jan Dijkzeul 2006-04-07 8:15 ` Junio C Hamano 2006-04-03 15:12 ` Johannes Schindelin 1 sibling, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2006-04-03 14:36 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Kees-Jan Dijkzeul, git On Mon, 3 Apr 2006, Linus Torvalds wrote: > > That said, I think git _does_ have problems with large pack-files. We have > some 32-bit issues etc I should clarify that. git _itself_ shouldn't have any 32-bit issues, but the packfile data structure does. The index has 32-bit offsets into individual pack-files. That's not hugely fundamental, but I didn't expect people to hit it this quickly. What kind of project has a 1.5GB pack-file _already_? I hope it's fifteen years of history (so that we'll have another fifteen years before we'll have to worry about 4GB pack-files ;) Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-03 14:36 ` Linus Torvalds @ 2006-04-05 13:24 ` Kees-Jan Dijkzeul 2006-04-05 14:14 ` Johannes Schindelin 2006-04-06 4:13 ` Junio C Hamano 2006-04-07 8:15 ` Junio C Hamano 1 sibling, 2 replies; 21+ messages in thread From: Kees-Jan Dijkzeul @ 2006-04-05 13:24 UTC (permalink / raw) To: git On 4/3/06, Linus Torvalds <torvalds@osdl.org> wrote: [...] > That's not hugely fundamental, but I didn't expect people to hit it this > quickly. What kind of project has a 1.5GB pack-file _already_? I hope it's > fifteen years of history (so that we'll have another fifteen years before > we'll have to worry about 4GB pack-files ;) I'm trying to get Git to manage my companies source tree. We're writing software for digital TV sets. Anyway, the archive is about 5Gb in size and contains binaries, zip files, excel sheets meeting minutes and whatnot. So it doesn't compress very well. The 1.5Gb pack file hardly contains any history at all (five commits or so). On the flip side, for now I'll be the only one adding to the archive, so at least it will not grow that fast ;-) Anyway, to reconstitute the tree, I need very nearly the entire pack, so limiting the pack size won't do much good, as git will still try to allocate a total of 1.5Gb memory (which, unfortunately, isn't there :-) Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the regular mmap for mapping pack files, only to discover that I compile without defining "NO_MMAP", so I've been using the stock mmap all along. So now I'm thinking that the cygwin mmap also does a malloc-and-read, just like git does with NO_MMAP. So I'll continue to investigate in that direction. To be continued... Groetjes, Kees-Jan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-05 13:24 ` Kees-Jan Dijkzeul @ 2006-04-05 14:14 ` Johannes Schindelin 2006-04-05 21:08 ` Christopher Faylor 2006-04-06 4:13 ` Junio C Hamano 1 sibling, 1 reply; 21+ messages in thread From: Johannes Schindelin @ 2006-04-05 14:14 UTC (permalink / raw) To: Kees-Jan Dijkzeul; +Cc: git Hi, On Wed, 5 Apr 2006, Kees-Jan Dijkzeul wrote: > On 4/3/06, Linus Torvalds <torvalds@osdl.org> wrote: > [...] > > That's not hugely fundamental, but I didn't expect people to hit it this > > quickly. What kind of project has a 1.5GB pack-file _already_? I hope it's > > fifteen years of history (so that we'll have another fifteen years before > > we'll have to worry about 4GB pack-files ;) > > I'm trying to get Git to manage my companies source tree. We're > writing software for digital TV sets. Anyway, the archive is about 5Gb > in size and contains binaries, zip files, excel sheets meeting minutes > and whatnot. So it doesn't compress very well. The 1.5Gb pack file > hardly contains any history at all (five commits or so). On the flip > side, for now I'll be the only one adding to the archive, so at least > it will not grow that fast ;-) > > Anyway, to reconstitute the tree, I need very nearly the entire pack, > so limiting the pack size won't do much good, as git will still try to > allocate a total of 1.5Gb memory (which, unfortunately, isn't there > :-) > > Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the > regular mmap for mapping pack files, only to discover that I compile > without defining "NO_MMAP", so I've been using the stock mmap all > along. So now I'm thinking that the cygwin mmap also does a > malloc-and-read, just like git does with NO_MMAP. So I'll continue to > investigate in that direction. I think cygwin's mmap() is based on the Win32 API equivalent, which could mean that it *is* memory mapped, but in a special area (which is smaller than 1.5 gigabyte). In this case, it would make sense to limit the pack size, thereby having several packs, and mmap() them as they are needed. Hth, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-05 14:14 ` Johannes Schindelin @ 2006-04-05 21:08 ` Christopher Faylor 2006-04-05 23:27 ` Rutger Nijlunsing 0 siblings, 1 reply; 21+ messages in thread From: Christopher Faylor @ 2006-04-05 21:08 UTC (permalink / raw) To: Johannes Schindelin, Kees-Jan Dijkzeul, git On Wed, Apr 05, 2006 at 04:14:20PM +0200, Johannes Schindelin wrote: >> Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the >> regular mmap for mapping pack files, only to discover that I compile >> without defining "NO_MMAP", so I've been using the stock mmap all >> along. So now I'm thinking that the cygwin mmap also does a >> malloc-and-read, just like git does with NO_MMAP. So I'll continue to >> investigate in that direction. > >I think cygwin's mmap() is based on the Win32 API equivalent, which could >mean that it *is* memory mapped, but in a special area (which is smaller >than 1.5 gigabyte). In this case, it would make sense to limit the pack >size, thereby having several packs, and mmap() them as they are needed. Yes, cygwin's mmap uses CreateFileMapping and MapViewOfFile. IIRC, Windows might have a 2G limitation lurking under the hood somewhere but I think that might be tweakable with some registry setting. cgf ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-05 21:08 ` Christopher Faylor @ 2006-04-05 23:27 ` Rutger Nijlunsing 2006-04-06 0:34 ` Christopher Faylor 0 siblings, 1 reply; 21+ messages in thread From: Rutger Nijlunsing @ 2006-04-05 23:27 UTC (permalink / raw) To: Christopher Faylor; +Cc: Johannes Schindelin, Kees-Jan Dijkzeul, git On Wed, Apr 05, 2006 at 05:08:44PM -0400, Christopher Faylor wrote: > On Wed, Apr 05, 2006 at 04:14:20PM +0200, Johannes Schindelin wrote: > >> Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the > >> regular mmap for mapping pack files, only to discover that I compile > >> without defining "NO_MMAP", so I've been using the stock mmap all > >> along. So now I'm thinking that the cygwin mmap also does a > >> malloc-and-read, just like git does with NO_MMAP. So I'll continue to > >> investigate in that direction. > > > >I think cygwin's mmap() is based on the Win32 API equivalent, which could > >mean that it *is* memory mapped, but in a special area (which is smaller > >than 1.5 gigabyte). In this case, it would make sense to limit the pack > >size, thereby having several packs, and mmap() them as they are needed. > > Yes, cygwin's mmap uses CreateFileMapping and MapViewOfFile. IIRC, > Windows might have a 2G limitation lurking under the hood somewhere but > I think that might be tweakable with some registry setting. Windows places its DLLs criss-cross through the memory space because every DLL on the system has its own preferred place to be loaded (the base address). This severely limits the amount of largest contiguous memory block available, which is needed for one mmap() I think. Several solutions exist: - enlarge the address space with the /3GB boot flag in boot.ini - rebase all DLLs with REBASE.EXE (part of platform sdk) . Just make them the same and fix them to a low address. Problem is rebasing system dlls since those are locked by the system. - at start of program before other DLLs are loaded, reserve an as large part of the memory as possible with VirtualAlloc() -- Rutger Nijlunsing ---------------------------------- eludias ed dse.nl never attribute to a conspiracy which can be explained by incompetence ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-05 23:27 ` Rutger Nijlunsing @ 2006-04-06 0:34 ` Christopher Faylor 0 siblings, 0 replies; 21+ messages in thread From: Christopher Faylor @ 2006-04-06 0:34 UTC (permalink / raw) To: git On Thu, Apr 06, 2006 at 01:27:39AM +0200, Rutger Nijlunsing wrote: >On Wed, Apr 05, 2006 at 05:08:44PM -0400, Christopher Faylor wrote: >> On Wed, Apr 05, 2006 at 04:14:20PM +0200, Johannes Schindelin wrote: >> >> Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the >> >> regular mmap for mapping pack files, only to discover that I compile >> >> without defining "NO_MMAP", so I've been using the stock mmap all >> >> along. So now I'm thinking that the cygwin mmap also does a >> >> malloc-and-read, just like git does with NO_MMAP. So I'll continue to >> >> investigate in that direction. >> > >> >I think cygwin's mmap() is based on the Win32 API equivalent, which could >> >mean that it *is* memory mapped, but in a special area (which is smaller >> >than 1.5 gigabyte). In this case, it would make sense to limit the pack >> >size, thereby having several packs, and mmap() them as they are needed. >> >> Yes, cygwin's mmap uses CreateFileMapping and MapViewOfFile. IIRC, >> Windows might have a 2G limitation lurking under the hood somewhere but >> I think that might be tweakable with some registry setting. > >Windows places its DLLs criss-cross through the memory space because >every DLL on the system has its own preferred place to be loaded (the >base address). This severely limits the amount of largest contiguous >memory block available, which is needed for one mmap() I think. > >Several solutions exist: > - enlarge the address space with the /3GB boot flag in boot.ini Thanks. The 3GB boot flag is what I was trying to remember. > - rebase all DLLs with REBASE.EXE (part of platform sdk) . > Just make them the same and fix them to a low address. > Problem is rebasing system dlls since those are locked by the system. Cygwin has its own version of rebase and a method for rebasing all of the dlls in the distribution. Using that may help squeeze out a little bit of memory. > - at start of program before other DLLs are loaded, > reserve an as large part of the memory as possible with > VirtualAlloc() Cygwin actually uses this trick to try to push DLLs into their right locations after a fork. It sort of works but sometimes, in a child proccess, Windows puts "stuff" in locations previously occupied by a DLL. I could swear that it does that just to be annoying... There is a chicken/egg problem here in that Cygwin uses Doug Lea's malloc and that version of malloc will use mmap when sbrk() fails -- as it is apt to do when allocating gigabytes of memory. So, using malloc is not a way to avoid mmap. cgf ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-05 13:24 ` Kees-Jan Dijkzeul 2006-04-05 14:14 ` Johannes Schindelin @ 2006-04-06 4:13 ` Junio C Hamano 1 sibling, 0 replies; 21+ messages in thread From: Junio C Hamano @ 2006-04-06 4:13 UTC (permalink / raw) To: Kees-Jan Dijkzeul; +Cc: git "Kees-Jan Dijkzeul" <k.j.dijkzeul@gmail.com> writes: > I'm trying to get Git to manage my companies source tree. We're > writing software for digital TV sets. Anyway, the archive is about 5Gb > in size and contains binaries, zip files, excel sheets meeting minutes > and whatnot. So it doesn't compress very well. The 1.5Gb pack file > hardly contains any history at all (five commits or so). On the flip > side, for now I'll be the only one adding to the archive, so at least > it will not grow that fast ;-) > > Anyway, to reconstitute the tree, I need very nearly the entire pack, > so limiting the pack size won't do much good, as git will still try to > allocate a total of 1.5Gb memory (which, unfortunately, isn't there > :-) Right now we LRU the pack files and evict older ones when we mmap too many, but the unit of eviction is the whole file, so it would not help the case like yours at all. It might be possible to mmap only part of a packfile, but it would involve fairly major surgery to sha1_file.c. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-03 14:36 ` Linus Torvalds 2006-04-05 13:24 ` Kees-Jan Dijkzeul @ 2006-04-07 8:15 ` Junio C Hamano 2006-04-07 8:27 ` Jakub Narebski 2006-04-07 14:11 ` Nicolas Pitre 1 sibling, 2 replies; 21+ messages in thread From: Junio C Hamano @ 2006-04-07 8:15 UTC (permalink / raw) To: git; +Cc: Kees-Jan Dijkzeul, Linus Torvalds Linus Torvalds <torvalds@osdl.org> writes: > On Mon, 3 Apr 2006, Linus Torvalds wrote: >> >> That said, I think git _does_ have problems with large pack-files. We have >> some 32-bit issues etc > > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but > the packfile data structure does. The index has 32-bit offsets into > individual pack-files. > > That's not hugely fundamental,... Linus _does_ understand what he means, but let me clarify and outline a possible future direction. * pack-*.pack file has the following format: - The header appears at the beginning and consists of the following: 4-byte signature 4-byte version number (network byte order) 4-byte number of objects contained in the pack (network byte order) Observation: we cannot have more than 4G versions ;-) and more than 4G objects in a pack. - The header is followed by number of object entries, each of which looks like this: (undeltified representation) n-byte type and length (4-bit type, (n-1)*7+4-bit length) compressed data (deltified representation) n-byte type and length (4-bit type, (n-1)*7+4-bit length) 20-byte base object name compressed delta data Observation: length of each object is encoded in a variable length format and is not constrained to 32-bit or anything. - The trailer records 20-byte SHA1 checksum of all of the above. * pack-*.idx file has the following format: - The header consists of 256 4-byte network byte order integers. N-th entry of this table records the number of objects in the corresponding pack, the first byte of whose object name are smaller than N. Observation: we would need to extend this to an array of 8-byte integers to go beyond 4G objects per pack, but it is not strictly necessary. - The header is followed by sorted 28-byte entries, one entry per object in the pack. Each entry is: 4-byte network byte order integer, recording where the object is stored in the packfile as the offset from the beginning. 20-byte object name. Observation: we would definitely need to extend this to 8-byte integer plus 20-byte object name to handle a packfile that is larger than 4GB. - The file is concluded with a trailer: A copy of the 20-byte SHA1 checksum at the end of corresponding packfile. 20-byte SHA1-checksum of all of the above. This is not fundamental, in that pack idx file is something we can regenerate from a packfile. The push/fetch transfer over git native protocols does not even transfer pack idx file; instead, the recipient uses git-index-pack to generate pack idx. git-index-pack would need to be updated to update the necessary fields to 8-byte integers, without breaking existing packfiles. The code to read idx file currently has a sanity check logic to make sure that the size of the idx file is consistent with 24-byte entries (the last entry in the header matches the number of objects recorded in the pack). So we could reliably tell between the current 24-byte version and 28-byte "beyond 4GB" version, and support both formats at the same time. Even after we start supporting the 28-byte "beyond 4GB" format, we can and we should continue writing the current 24-byte version of pack idx file when the packfile offset can be expressed with 32-bit. Having said that, I have to warn that this is not for weak of heart. The necessary changes would be somewhat involved. ---------------------------------------------------------------- Pack idx file idx +--------------------------------+ | fanout[0] = 2 |-. +--------------------------------+ | | fanout[1] | | +--------------------------------+ | | fanout[2] | | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | fanout[255] | | +--------------------------------+ | main | offset | | index | object name 00XXXXXXXXXXXXXXXX | | table +--------------------------------+ | | offset | | | object name 00XXXXXXXXXXXXXXXX | | +--------------------------------+ | .-| offset |<+ | | object name 01XXXXXXXXXXXXXXXX | | +--------------------------------+ | | offset | | | object name 01XXXXXXXXXXXXXXXX | | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | offset | | | object name FFXXXXXXXXXXXXXXXX | | +--------------------------------+ trailer | | packfile checksum | | +--------------------------------+ | | idxfile checksum | | +--------------------------------+ .-------. | Pack file entry: <+ packed object header: 1-byte type (bit 4-6) size0 (bit 0-3) end-of-length (bit 7) n-byte sizeN (as long as MSB is set, each 7-bit) size0..sizeN form 4+7+7+..+7 bit integer, size0 is the most significant part. packed object data: If it is not DELTA, then deflated bytes (the size above is the size before compression). If it is DELTA, then 20-byte base object name SHA1 (the size above is the size of the delta data that follows). delta data, deflated. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-07 8:15 ` Junio C Hamano @ 2006-04-07 8:27 ` Jakub Narebski 2006-04-07 14:11 ` Nicolas Pitre 1 sibling, 0 replies; 21+ messages in thread From: Jakub Narebski @ 2006-04-07 8:27 UTC (permalink / raw) To: git Junio C Hamano wrote: > * pack-*.pack file has the following format: [...] > * pack-*.idx file has the following format: [...] Could you please put the information in parent post somewhere in Documentation, for example Documentation/technical/pack-format.txt (perhaps together with putting description of packing heuristic from http://marc.theaimsgroup.com/?l=git&m=114134881923320 by Jon Loeliger in Documentation/technical/pack-heuristics.txt even if it doesn't conform to "serious documentation" standards)? Thanks in advance -- Jakub Narebski Warsaw, Poland ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-07 8:15 ` Junio C Hamano 2006-04-07 8:27 ` Jakub Narebski @ 2006-04-07 14:11 ` Nicolas Pitre 2006-04-07 18:31 ` Junio C Hamano 1 sibling, 1 reply; 21+ messages in thread From: Nicolas Pitre @ 2006-04-07 14:11 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Kees-Jan Dijkzeul, Linus Torvalds On Fri, 7 Apr 2006, Junio C Hamano wrote: > Linus Torvalds <torvalds@osdl.org> writes: > > > On Mon, 3 Apr 2006, Linus Torvalds wrote: > >> > >> That said, I think git _does_ have problems with large pack-files. We have > >> some 32-bit issues etc > > > > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but > > the packfile data structure does. The index has 32-bit offsets into > > individual pack-files. > > > > That's not hugely fundamental,... > > Linus _does_ understand what he means, but let me clarify and > outline a possible future direction. > [...] For the record, the delta code also has 32-bit limitations of its own presently. It cannot encode a delta against a buffer which is larger than 4GB. I however made sure the byte 0 could be used as a prefix for future encoding extensions, like 64-bit file offsets for example. Nicolas ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-07 14:11 ` Nicolas Pitre @ 2006-04-07 18:31 ` Junio C Hamano 2006-04-07 18:46 ` Nicolas Pitre 0 siblings, 1 reply; 21+ messages in thread From: Junio C Hamano @ 2006-04-07 18:31 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git Nicolas Pitre <nico@cam.org> writes: > On Fri, 7 Apr 2006, Junio C Hamano wrote: > >> Linus Torvalds <torvalds@osdl.org> writes: >> >> > On Mon, 3 Apr 2006, Linus Torvalds wrote: >> >> >> >> That said, I think git _does_ have problems with large pack-files. We have >> >> some 32-bit issues etc >> > >> > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but >> > the packfile data structure does. The index has 32-bit offsets into >> > individual pack-files. >> > >> > That's not hugely fundamental,... >> >> Linus _does_ understand what he means, but let me clarify and >> outline a possible future direction. > > For the record, the delta code also has 32-bit limitations of its own > presently. It cannot encode a delta against a buffer which is larger > than 4GB. > > I however made sure the byte 0 could be used as a prefix for future > encoding extensions, like 64-bit file offsets for example. True the delta data representation, not just the "delta code", has that limitation, but I do not think you issue "insert 0-byte literal data" command from the deltifier side right now, so we should be OK. Maybe we would want to check (cmd == 0) case to detect delta extension that we do not handle right now? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-07 18:31 ` Junio C Hamano @ 2006-04-07 18:46 ` Nicolas Pitre 0 siblings, 0 replies; 21+ messages in thread From: Nicolas Pitre @ 2006-04-07 18:46 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, 7 Apr 2006, Junio C Hamano wrote: > Nicolas Pitre <nico@cam.org> writes: > > > On Fri, 7 Apr 2006, Junio C Hamano wrote: > > > >> Linus Torvalds <torvalds@osdl.org> writes: > >> > >> > On Mon, 3 Apr 2006, Linus Torvalds wrote: > >> >> > >> >> That said, I think git _does_ have problems with large pack-files. We have > >> >> some 32-bit issues etc > >> > > >> > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but > >> > the packfile data structure does. The index has 32-bit offsets into > >> > individual pack-files. > >> > > >> > That's not hugely fundamental,... > >> > >> Linus _does_ understand what he means, but let me clarify and > >> outline a possible future direction. > > > > For the record, the delta code also has 32-bit limitations of its own > > presently. It cannot encode a delta against a buffer which is larger > > than 4GB. > > > > I however made sure the byte 0 could be used as a prefix for future > > encoding extensions, like 64-bit file offsets for example. > > True the delta data representation, not just the "delta code", > has that limitation, but I do not think you issue "insert 0-byte > literal data" command from the deltifier side right now, so we > should be OK. > > Maybe we would want to check (cmd == 0) case to detect delta > extension that we do not handle right now? Good idea. Will send you a patch. Nicolas ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-03 14:33 ` Linus Torvalds 2006-04-03 14:36 ` Linus Torvalds @ 2006-04-03 15:12 ` Johannes Schindelin 1 sibling, 0 replies; 21+ messages in thread From: Johannes Schindelin @ 2006-04-03 15:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kees-Jan Dijkzeul, git Hi, On Mon, 3 Apr 2006, Linus Torvalds wrote: > On Mon, 3 Apr 2006, Johannes Schindelin wrote: > > > > The problem is not mmap() on cygwin, but that a fork() has to jump through > > loops to reinstall the open file descriptors on cygwin. If the > > corresponding file was deleted, that fails. Therefore, we work around that > > on cygwin by actually reading the file into memory, *not* mmap()ing it. > > Well, we could actually do a _real_ mmap on pack-files. The pack-files are > much better mmap'ed - there we don't _want_ them to be removed while we're > using them. It was the index file etc that was problematic. > > Maybe the cygwin fake mmap should be triggered only for the index (and > possibly the individual objects - if only because there doing a > malloc+read may actually be faster). I hit the problem *only* with "git-whatchanged -p". Which means that the upcoming we-no-longer-write-temp-files-for-diff version should make that gitfakemmap() hack obsolete. (I have not checked whether there are other places where a file is mmap()ed and then used by a fork()ed process.) Ciao, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-03 9:46 Cygwin can't handle huge packfiles? Kees-Jan Dijkzeul 2006-04-03 13:23 ` Johannes Schindelin @ 2006-04-03 14:38 ` Alex Riesen 1 sibling, 0 replies; 21+ messages in thread From: Alex Riesen @ 2006-04-03 14:38 UTC (permalink / raw) To: Kees-Jan Dijkzeul; +Cc: git [-- Attachment #1: Type: text/plain, Size: 1189 bytes --] On 4/3/06, Kees-Jan Dijkzeul <k.j.dijkzeul@gmail.com> wrote: > I'm trying to get Git to manage a 5Gb source tree. Under linux, this > works like a charm. Under cygwin, however, I run in to difficulties. > For example: > > $ git-clone sgp-wa/ sgp-wa.clone > fatal: packfile > ./objects/pack/pack-56aa013a0234e198467ed37ae5db925764a6ee98.pack > cannot be mapped. > fatal: unexpected EOF > fetch-pack from '/cygdrive/e/Projects/sgp-wa/.git' failed. > > To figure out what is happening, I printed the value of errno, which > turns out to be 12 (Cannot allocate memory). I'm not sure how mmap is mmap in git on cygwin does not mmaps anything, but just reads the whole file in memory. > I'm not sure how to approach this problem. Any tips would be greatly > appreciated. I ended up hacking gitfakemmap like in the attached patches (sorry for mime). It's very ugly and unsafe hack, and it's actually exactly the reason why it was never submitted. Still, it helps me (it speedups revlist, for instance), and maybe it'll help you. It is a really good example what stupid windows restrictions can do to a program. The patch is against git as of 3-Apr-2005, ~10 CET [-- Attachment #2: cygmmap.patch --] [-- Type: text/x-patch, Size: 5710 bytes --] diff --git a/Makefile b/Makefile index c79d646..8a46436 --- a/Makefile +++ b/Makefile @@ -389,7 +389,7 @@ ifdef NO_SETENV endif ifdef NO_MMAP COMPAT_CFLAGS += -DNO_MMAP - COMPAT_OBJS += compat/mmap.o + COMPAT_OBJS += compat/mmap.o compat/realmmap.o endif ifdef NO_IPV6 ALL_CFLAGS += -DNO_IPV6 diff --git a/compat/realmmap.c b/compat/realmmap.c new file mode 100644 index 0000000..8f26641 --- /dev/null +++ b/compat/realmmap.c @@ -0,0 +1,26 @@ +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <errno.h> +#include <sys/mman.h> +#include "../git-compat-util.h" + +#undef mmap +#undef munmap + +void *realmmap(void *start, size_t length, int prot , int flags, int fd, off_t offset) +{ + if (start != NULL || !(flags & MAP_PRIVATE)) { + errno = ENOTSUP; + return MAP_FAILED; + } + start = mmap(start, length, prot, flags, fd, offset); + return start; +} + +int realmunmap(void *start, size_t length) +{ + return munmap(start, length); +} + + diff --git a/diff.c b/diff.c index e496905..f1a2cf0 100644 --- a/diff.c +++ b/diff.c @@ -450,7 +450,7 @@ int diff_populate_filespec(struct diff_f fd = open(s->path, O_RDONLY); if (fd < 0) goto err_empty; - s->data = mmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0); + s->data = realmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); if (s->data == MAP_FAILED) goto err_empty; @@ -482,7 +482,7 @@ void diff_free_filespec_data(struct diff if (s->should_free) free(s->data); else if (s->should_munmap) - munmap(s->data, s->size); + realmunmap(s->data, s->size); s->should_free = s->should_munmap = 0; s->data = NULL; free(s->cnt_data); diff --git a/git-compat-util.h b/git-compat-util.h index 5d543d2..85150f8 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -42,22 +42,28 @@ extern int error(const char *err, ...) _ #ifdef NO_MMAP -#ifndef PROT_READ +#include <sys/mman.h> +/*#ifndef PROT_READ #define PROT_READ 1 #define PROT_WRITE 2 #define MAP_PRIVATE 1 #define MAP_FAILED ((void*)-1) -#endif +#endif*/ #define mmap gitfakemmap #define munmap gitfakemunmap extern void *gitfakemmap(void *start, size_t length, int prot , int flags, int fd, off_t offset); extern int gitfakemunmap(void *start, size_t length); +extern void *realmmap(void *start, size_t length, int prot , int flags, int fd, off_t offset); +extern int realmunmap(void *start, size_t length); + #else /* NO_MMAP */ #include <sys/mman.h> +#define realmmap mmap +#define realmunmap munmap #endif /* NO_MMAP */ #ifdef NO_SETENV diff --git a/sha1_file.c b/sha1_file.c index 58edec0..712a068 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -330,14 +330,14 @@ void prepare_alt_odb(void) close(fd); return; } - map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0); + map = realmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); if (map == MAP_FAILED) return; link_alt_odb_entries(map, map + st.st_size, '\n', get_object_directory()); - munmap(map, st.st_size); + realmunmap(map, st.st_size); } static char *find_sha1_file(const unsigned char *sha1, struct stat *st) @@ -378,7 +378,7 @@ static int check_packed_git_idx(const ch return -1; } idx_size = st.st_size; - idx_map = mmap(NULL, idx_size, PROT_READ, MAP_PRIVATE, fd, 0); + idx_map = realmmap(NULL, idx_size, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); if (idx_map == MAP_FAILED) return -1; @@ -423,7 +423,7 @@ static int unuse_one_packed_git(void) } if (!lru) return 0; - munmap(lru->pack_base, lru->pack_size); + realmunmap(lru->pack_base, lru->pack_size); lru->pack_base = NULL; return 1; } @@ -460,7 +460,7 @@ int use_packed_git(struct packed_git *p) } if (st.st_size != p->pack_size) die("packfile %s size mismatch.", p->pack_name); - map = mmap(NULL, p->pack_size, PROT_READ, MAP_PRIVATE, fd, 0); + map = realmmap(NULL, p->pack_size, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); if (map == MAP_FAILED) die("packfile %s cannot be mapped.", p->pack_name); @@ -494,7 +494,7 @@ struct packed_git *add_packed_git(char * /* do we have a corresponding .pack file? */ strcpy(path + path_len - 4, ".pack"); if (stat(path, &st) || !S_ISREG(st.st_mode)) { - munmap(idx_map, idx_size); + realmunmap(idx_map, idx_size); return NULL; } /* ok, it looks sane as far as we can check without @@ -647,7 +647,7 @@ static void *map_sha1_file_internal(cons */ sha1_file_open_flag = 0; } - map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0); + map = realmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); if (map == MAP_FAILED) return NULL; @@ -1184,7 +1184,7 @@ int sha1_object_info(const unsigned char *sizep = size; } inflateEnd(&stream); - munmap(map, mapsize); + realmunmap(map, mapsize); return status; } @@ -1210,7 +1210,7 @@ void * read_sha1_file(const unsigned cha map = map_sha1_file_internal(sha1, &mapsize); if (map) { buf = unpack_sha1_file(map, mapsize, type, size); - munmap(map, mapsize); + realmunmap(map, mapsize); return buf; } return NULL; @@ -1493,7 +1493,7 @@ int write_sha1_to_fd(int fd, const unsig } while (posn < objsize); if (map) - munmap(map, objsize); + realmunmap(map, objsize); if (temp_obj) free(temp_obj); @@ -1646,7 +1646,7 @@ int index_fd(unsigned char *sha1, int fd buf = ""; if (size) - buf = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0); + buf = realmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); if (buf == MAP_FAILED) return -1; @@ -1660,7 +1660,7 @@ int index_fd(unsigned char *sha1, int fd ret = 0; } if (size) - munmap(buf, size); + realmunmap(buf, size); return ret; } ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles?
@ 2006-04-06 20:57 linux
2006-04-06 23:53 ` Junio C Hamano
0 siblings, 1 reply; 21+ messages in thread
From: linux @ 2006-04-06 20:57 UTC (permalink / raw)
To: git, junkio; +Cc: linux
> Right now we LRU the pack files and evict older ones when we
> mmap too many, but the unit of eviction is the whole file, so it
> would not help the case like yours at all. It might be possible
> to mmap only part of a packfile, but it would involve fairly
> major surgery to sha1_file.c.
The simplest solution seems to be to limit pack file size to a reasonable
fraction of a 32-bit address space. Say, 0.5 G.
That should be a fairly straightforward hack to git-pack-objects.
It already emits two files; just make it emit more.
You can tweak the heurisitics to try to find a good break point: start
thinking about splitting the pack when you get to one size, but don't
force a break until you hit a harder limit as long as the deltas are
working well.
This can all be adjustable with a command line and/or config file option
to allow for the eventual demise of 32-bit systems.
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: Cygwin can't handle huge packfiles? 2006-04-06 20:57 linux @ 2006-04-06 23:53 ` Junio C Hamano 2006-04-07 3:05 ` linux 0 siblings, 1 reply; 21+ messages in thread From: Junio C Hamano @ 2006-04-06 23:53 UTC (permalink / raw) To: linux; +Cc: git linux@horizon.com writes: >> Right now we LRU the pack files and evict older ones when we >> mmap too many, but the unit of eviction is the whole file, so it >> would not help the case like yours at all. It might be possible >> to mmap only part of a packfile, but it would involve fairly >> major surgery to sha1_file.c. > > The simplest solution seems to be to limit pack file size to a reasonable > fraction of a 32-bit address space. Say, 0.5 G. I do not think that would help the original poster's situation where only 5 revs result in a 1.5G pack. I would _almost_ say "do not pack such a repository", but there is the initial cloning over git-aware transports which always results in a repository with a single pack. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Cygwin can't handle huge packfiles? 2006-04-06 23:53 ` Junio C Hamano @ 2006-04-07 3:05 ` linux 0 siblings, 0 replies; 21+ messages in thread From: linux @ 2006-04-07 3:05 UTC (permalink / raw) To: junkio, linux; +Cc: git > I do not think that would help the original poster's situation > where only 5 revs result in a 1.5G pack. I would _almost_ say > "do not pack such a repository", but there is the initial > cloning over git-aware transports which always results in a > repository with a single pack. Huh? Why not? That repository has a lot of files. For compression, you want all versions of a file in one pack, and with few versions that makes it easier to split up, not harder. As for network transport of packs, I haven't studied the details, but if you allow "thin packs" that have deltas relative to objects not in the pack, then breaking up the pack anywhere should be legal. Or, if necessary, you can stuff an arbitrarily large file through git-unpack-objects, which reads a stream from stdin without attempting to mmap it. (Speaking of unpack-objects.c, what's that "static unsigned long eof" variable in there? It never seems to be set to a non-zero value.) ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2006-04-07 18:47 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-04-03 9:46 Cygwin can't handle huge packfiles? Kees-Jan Dijkzeul 2006-04-03 13:23 ` Johannes Schindelin 2006-04-03 14:26 ` Morten Welinder 2006-04-03 14:33 ` Linus Torvalds 2006-04-03 14:36 ` Linus Torvalds 2006-04-05 13:24 ` Kees-Jan Dijkzeul 2006-04-05 14:14 ` Johannes Schindelin 2006-04-05 21:08 ` Christopher Faylor 2006-04-05 23:27 ` Rutger Nijlunsing 2006-04-06 0:34 ` Christopher Faylor 2006-04-06 4:13 ` Junio C Hamano 2006-04-07 8:15 ` Junio C Hamano 2006-04-07 8:27 ` Jakub Narebski 2006-04-07 14:11 ` Nicolas Pitre 2006-04-07 18:31 ` Junio C Hamano 2006-04-07 18:46 ` Nicolas Pitre 2006-04-03 15:12 ` Johannes Schindelin 2006-04-03 14:38 ` Alex Riesen -- strict thread matches above, loose matches on Subject: below -- 2006-04-06 20:57 linux 2006-04-06 23:53 ` Junio C Hamano 2006-04-07 3:05 ` linux
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).