* Figured out how to get Mozilla into git @ 2006-06-09 2:17 Jon Smirl 2006-06-09 2:56 ` Nicolas Pitre ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Jon Smirl @ 2006-06-09 2:17 UTC (permalink / raw) To: git I was able to import Mozilla into SVN without problem, it just occured to me to then import the SVN repository in git. The import has been running a few hours now and it is up to the year 2000 (starts in 1998). Since I haven't hit any errors yet it will probably finish ok. I should have the results in the morning. I wonder how long it will take to start gitk on a 10GB repository. Once I get this monster into git, are there tools that will let me keep it in sync with Mozilla CVS? SVN renamed numeric branches to this form, unlabeled-3.7.24, so that may be a problem. Any advice on how to pack this to make it run faster? -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 2:17 Figured out how to get Mozilla into git Jon Smirl @ 2006-06-09 2:56 ` Nicolas Pitre 2006-06-09 3:06 ` Martin Langhoff 2006-06-09 3:12 ` Figured out how to get Mozilla into git Pavel Roskin 2 siblings, 0 replies; 67+ messages in thread From: Nicolas Pitre @ 2006-06-09 2:56 UTC (permalink / raw) To: Jon Smirl; +Cc: git On Thu, 8 Jun 2006, Jon Smirl wrote: > I was able to import Mozilla into SVN without problem, it just occured > to me to then import the SVN repository in git. The import has been > running a few hours now and it is up to the year 2000 (starts in > 1998). Since I haven't hit any errors yet it will probably finish ok. > I should have the results in the morning. I wonder how long it will > take to start gitk on a 10GB repository. Before you do so consider repacking the repository with git-repack -a -f -d && git-prune-packed Nicolas ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 2:17 Figured out how to get Mozilla into git Jon Smirl 2006-06-09 2:56 ` Nicolas Pitre @ 2006-06-09 3:06 ` Martin Langhoff 2006-06-09 3:28 ` Jon Smirl ` (2 more replies) 2006-06-09 3:12 ` Figured out how to get Mozilla into git Pavel Roskin 2 siblings, 3 replies; 67+ messages in thread From: Martin Langhoff @ 2006-06-09 3:06 UTC (permalink / raw) To: Jon Smirl; +Cc: git Jon, oh, I went back to a cvsimport that I started a couple days ago. Completed with no problems... Last commit: commit 5ecb56b9c4566618fad602a8da656477e4c6447a Author: wtchang%redhat.com <wtchang%redhat.com> Date: Fri Jun 2 17:20:37 2006 +0000 Import NSPR 4.6.2 and NSS 3.11.1 mozilla.git$ du -sh .git/ 2.0G .git/ It took 43492.19user 53504.77system 40:23:49elapsed 66%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (77334major+3122469478minor)pagefaults 0swaps > I should have the results in the morning. I wonder how long it will > take to start gitk on a 10GB repository. Hopefully not that big :) -- anyway, just do gitk --max-count=1000 > Once I get this monster into git, are there tools that will let me > keep it in sync with Mozilla CVS? If you use git-cvsimport, you can safely re-run it on a cronjob to keep it in sync. Not too sure about the cvs2svn => git-svnimport, though git-svnimport does support incremental imports. > SVN renamed numeric branches to this form, unlabeled-3.7.24, so that > may be a problem. Ouch, > Any advice on how to pack this to make it run faster? git-repack -a -d but it OOMs on my 2GB+2GBswap machine :( martin ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 3:06 ` Martin Langhoff @ 2006-06-09 3:28 ` Jon Smirl 2006-06-09 7:17 ` Jakub Narebski 2006-06-09 18:13 ` Jon Smirl 2006-06-10 1:14 ` Martin Langhoff 2 siblings, 1 reply; 67+ messages in thread From: Jon Smirl @ 2006-06-09 3:28 UTC (permalink / raw) To: Martin Langhoff; +Cc: git On 6/8/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > Jon, > > oh, I went back to a cvsimport that I started a couple days ago. > Completed with no problems... I am using cvsps-2.1-3.fc5, the last time I tried it died in the middle of the import. I don't remember why it died. Which cvsps are you using? You're saying that it can handle the whole Mozilla CVS now, right? I will build a new cvsps from CVS and start it running tonight. > If you use git-cvsimport, you can safely re-run it on a cronjob to > keep it in sync. Not too sure about the cvs2svn => git-svnimport, > though git-svnimport does support incremental imports. I would much rather get a direct CVS import working so that I can do incremental updates. I went the SVN route because it was the only thing I could get working. > > Any advice on how to pack this to make it run faster? > > git-repack -a -d but it OOMs on my 2GB+2GBswap machine :( We are all having problems getting this to run on 32 bit machines with the 3-4GB process size limitations. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 3:28 ` Jon Smirl @ 2006-06-09 7:17 ` Jakub Narebski 2006-06-09 15:01 ` Linus Torvalds 0 siblings, 1 reply; 67+ messages in thread From: Jakub Narebski @ 2006-06-09 7:17 UTC (permalink / raw) To: git Jon Smirl wrote: >> git-repack -a -d but it OOMs on my 2GB+2GBswap machine :( > > We are all having problems getting this to run on 32 bit machines with > the 3-4GB process size limitations. Is that expected (for 10GB repository if I remember correctly), or is there some way to avoid this OOM? -- Jakub Narebski Warsaw, Poland ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 7:17 ` Jakub Narebski @ 2006-06-09 15:01 ` Linus Torvalds 2006-06-09 16:11 ` Nicolas Pitre 0 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-09 15:01 UTC (permalink / raw) To: Jakub Narebski; +Cc: git On Fri, 9 Jun 2006, Jakub Narebski wrote: > Jon Smirl wrote: > > >> git-repack -a -d but it OOMs on my 2GB+2GBswap machine :( > > > > We are all having problems getting this to run on 32 bit machines with > > the 3-4GB process size limitations. > > Is that expected (for 10GB repository if I remember correctly), or is there > some way to avoid this OOM? Well, to some degree, the VM limitations are inevitable with huge packs. The original idea for packs was to avoid making one huge pack, partly because it was expected to be really really slow to generate (so incremental repacking was a much better strategy), but partly simply because trying to map one huge pack is really hard to do. For various reasons, we ended up mostly using a single pack most of the time: it's the most efficient model when the project is reasonably sized, and it turns out that with the delta re-use, repacking even moderately large projects like the kernel doesn't actually take all that long. But the fact that we ended up mostly using a single pack for the kernel, for example, doesn't mean that the fundamental reasons that git supports multiple packs would somehow have gone away. At some point, the project gets large enough that one single pack simply isn't reasonable. So a single 2GB pack is already very much pushing it. It's really really hard to map in a 2GB file on a 32-bit platform: your VM is usually fragmented enough that it simply isn't practical. In fact, I think the limit for _practical_ usage of single packs is probably somewhere in the half-gig region, unless you just have 64-bit machines. And yes, I realize that the "single pack" thing actually ends up having become a fact for cloning, for example. Originally, cloning would unpack on the receiving end, and leave the repacking to happen there, but that obviously sucked. So now when we clone, we always get a single pack. That can absolutely be a problem. I don't know what the right solution is. Single packs _are_ very useful, especially after a clone. So it's possible that we should just make the pack-reading code be able to map partial packs. But the point is that there are certainly ways we can fix this - it's not _really_ fundamental. It's going to complicate it a bit (damn, how I hate 32-bit VM limitations), but the good news is that the whole git model of "everything is an individual object" means that it's a very _local_ decision: it will probably be painful to re-do some of the pack reading code and have a LRU of pack _fragments_ instead of a LRU of packs, but it's only going to affect a small part of git, and everything else will never even see it. So large packs are not really a fundamental problem, but right now we have some practical issues with them. (It's not _just_ packs: running out of memory is also because of git-rev-list --objects being pretty memory hungry. I've improved the memory usage several times by over 50%, but people keep trying larger projects. It used to be that I considered the kernel a large history, now we're talking about things that have ten times the number of objects). Martin - do you have some place to make that big mozilla repo available? It would be a good test-case.. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 15:01 ` Linus Torvalds @ 2006-06-09 16:11 ` Nicolas Pitre 2006-06-09 16:30 ` Linus Torvalds 2006-06-09 17:10 ` Jakub Narebski 0 siblings, 2 replies; 67+ messages in thread From: Nicolas Pitre @ 2006-06-09 16:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jakub Narebski, git On Fri, 9 Jun 2006, Linus Torvalds wrote: > > > On Fri, 9 Jun 2006, Jakub Narebski wrote: > > Jon Smirl wrote: > > > > >> git-repack -a -d but it OOMs on my 2GB+2GBswap machine :( > > > > > > We are all having problems getting this to run on 32 bit machines with > > > the 3-4GB process size limitations. > > > > Is that expected (for 10GB repository if I remember correctly), or is there > > some way to avoid this OOM? What was that 10GB related to, exactly? The original CVS repo, or the unpacked GIT repo? > So a single 2GB pack is already very much pushing it. It's really really > hard to map in a 2GB file on a 32-bit platform: your VM is usually > fragmented enough that it simply isn't practical. In fact, I think the > limit for _practical_ usage of single packs is probably somewhere in the > half-gig region, unless you just have 64-bit machines. Sure, but have we already reached that size? The historic Linux repo currently repacks itself into a ~175MB pack for 63428 commits. The current Linux repo is ~103MB with a much shorter history (27153 commits). Given the above we can estimate the size of the kernel repository after x commits as follows: slope = (175 - 103) / (63428 - 27153) = approx 2KB per commit initial size = 175 - .001985 * 63428 = 49MB So the initial kernel commit is about 49MB in size which is coherent with the corresponding compressed tarball. Subsequent commits are 2KB in size on average. Given that it will take about 233250 commits before the kernel reaches the half gigabyte pack file, and given the current commit rate (approx 23700 commits per year), that means we still have nearly 9 years to go. And at that point 64-bit machines are likely to be the norm. So given those numbers I don't think this is really an issue. The Linux kernel is a rather huge and pretty active project to base comparisons against. The Mozilla repository might be difficult to import and repack, but once repacked it should still be pretty usable now even on a 32-bit machine even with a single pack. Otherwise that should be quite easy to add a batch size argument to git-repack so git-rev-list and git-pack-objects are called multiple times with sequential commit ranges to create a repo with multiple packs. Nicolas ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 16:11 ` Nicolas Pitre @ 2006-06-09 16:30 ` Linus Torvalds 2006-06-09 17:38 ` Nicolas Pitre 2006-06-09 17:10 ` Jakub Narebski 1 sibling, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-09 16:30 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jakub Narebski, git On Fri, 9 Jun 2006, Nicolas Pitre wrote: > > > So a single 2GB pack is already very much pushing it. It's really really > > hard to map in a 2GB file on a 32-bit platform: your VM is usually > > fragmented enough that it simply isn't practical. In fact, I think the > > limit for _practical_ usage of single packs is probably somewhere in the > > half-gig region, unless you just have 64-bit machines. > > Sure, but have we already reached that size? Not for the Linux repos. But apparently the mozilla repo ends up being 2GB in git. From Martin: >> oh, I went back to a cvsimport that I started a couple days ago. >> Completed with no problems... >> >> Last commit: >> commit 5ecb56b9c4566618fad602a8da656477e4c6447a >> Author: wtchang%redhat.com <wtchang%redhat.com> >> Date: Fri Jun 2 17:20:37 2006 +0000 >> >> Import NSPR 4.6.2 and NSS 3.11.1 >> >> mozilla.git$ du -sh .git/ >> 2.0G .git/ now that was done with _incremental_ repacking (ie his .git directory won't be just one large pack), but I bet that if you were to clone it (without using the "-l" flag or rsync/http), you'd end up with serious trouble because of the single-pack limit. So we're starting to see archives where single packs are problematic for a 32-bit architecture. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 16:30 ` Linus Torvalds @ 2006-06-09 17:38 ` Nicolas Pitre 2006-06-09 17:49 ` Linus Torvalds 0 siblings, 1 reply; 67+ messages in thread From: Nicolas Pitre @ 2006-06-09 17:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jakub Narebski, git On Fri, 9 Jun 2006, Linus Torvalds wrote: > > > On Fri, 9 Jun 2006, Nicolas Pitre wrote: > > > > > So a single 2GB pack is already very much pushing it. It's really really > > > hard to map in a 2GB file on a 32-bit platform: your VM is usually > > > fragmented enough that it simply isn't practical. In fact, I think the > > > limit for _practical_ usage of single packs is probably somewhere in the > > > half-gig region, unless you just have 64-bit machines. > > > > Sure, but have we already reached that size? > > Not for the Linux repos. > > But apparently the mozilla repo ends up being 2GB in git. From Martin: > > >> oh, I went back to a cvsimport that I started a couple days ago. > >> Completed with no problems... > >> > >> Last commit: > >> commit 5ecb56b9c4566618fad602a8da656477e4c6447a > >> Author: wtchang%redhat.com <wtchang%redhat.com> > >> Date: Fri Jun 2 17:20:37 2006 +0000 > >> > >> Import NSPR 4.6.2 and NSS 3.11.1 > >> > >> mozilla.git$ du -sh .git/ > >> 2.0G .git/ He also sais: | git-repack -a -d but it OOMs on my 2GB+2GBswap machine :( > now that was done with _incremental_ repacking (ie his .git directory > won't be just one large pack), So given the nature of packs, incrementally packing an imported repository _might_ cause worse problems since each pack must be self referenced by definition. That means you may end up with multiple revisions of the same file distributed amongst as many packs hence none of those revisions are ever deltified, and to repack that you currently have to mmap all those packs at once. > but I bet that if you were to clone it > (without using the "-l" flag or rsync/http), you'd end up with serious > trouble because of the single-pack limit. Maybe that single pack would instead be under the 512MB limit? I'd be curious to know. > So we're starting to see archives where single packs are problematic for > a 32-bit architecture. Depending on the operation, the single pack might actually be better, especially for a full clone where everything gets mapped. Multiple packs will always take more space, which is fine if you don't need access to all objects at once since individual packs are small, but the whole of them (when repacking or cloning) isn't. Nicolas ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 17:38 ` Nicolas Pitre @ 2006-06-09 17:49 ` Linus Torvalds 0 siblings, 0 replies; 67+ messages in thread From: Linus Torvalds @ 2006-06-09 17:49 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jakub Narebski, git On Fri, 9 Jun 2006, Nicolas Pitre wrote: > > Maybe that single pack would instead be under the 512MB limit? I'd be > curious to know. Possible, but not likely, and with "git repack -a -d" running out of memory, we clearly already have a problem in checking that. That is most likely git-rev-list, though. Which is why I'd like to just rsync the repo, and run git-rev-list on it, and see what else I can shave off ;) > > So we're starting to see archives where single packs are problematic for > > a 32-bit architecture. > > Depending on the operation, the single pack might actually be better, Absolutely. Which is why I said we probably need to do a LRU on pack fragments rather than full packs when we do the pack memory mapping. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 16:11 ` Nicolas Pitre 2006-06-09 16:30 ` Linus Torvalds @ 2006-06-09 17:10 ` Jakub Narebski 1 sibling, 0 replies; 67+ messages in thread From: Jakub Narebski @ 2006-06-09 17:10 UTC (permalink / raw) To: git Nicolas Pitre wrote: > What was that 10GB related to, exactly? The original CVS repo, or the > unpacked GIT repo? Erm, Subversion repository, result of cvs2svn conversion: Jon Smirl> I wonder how long it will take to start gitk on a 10GB Jon Smirl> repository. (in first post in this thread). > Otherwise that should be quite easy to add a batch size argument to > git-repack so git-rev-list and git-pack-objects are called multiple > times with sequential commit ranges to create a repo with multiple > packs. Good idea. In addition to best size pack limted by 32bit and/or RAM size + swap size limit, there are (rare) limits of maximum filesize on filesystem, e.g. FAT28^W FAT32. -- Jakub Narebski Warsaw, Poland ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 3:06 ` Martin Langhoff 2006-06-09 3:28 ` Jon Smirl @ 2006-06-09 18:13 ` Jon Smirl 2006-06-09 19:00 ` Linus Torvalds 2006-06-10 1:14 ` Martin Langhoff 2 siblings, 1 reply; 67+ messages in thread From: Jon Smirl @ 2006-06-09 18:13 UTC (permalink / raw) To: Martin Langhoff; +Cc: git On 6/8/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > mozilla.git$ du -sh .git/ > 2.0G .git/ That looks too small. My svn git import is 2.7GB and the source CVS is 3.0GB. The svn import wasn't finished when I stopped it. My cvsps process is still running from last night. The error file is 341MB. How big is it when the conversion is finished? My machine is swapping to death. I'm still attracted to the cvs2svn tool. It handled everything right the first time and it only needs 100MB to run. It is also a lot faster. cvsps and parsecvs both need gigabytes of RAM to run. I'll look at cvs2svn some more but I still need to figure out more about low level git and learn Python. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 18:13 ` Jon Smirl @ 2006-06-09 19:00 ` Linus Torvalds 2006-06-09 20:17 ` Jon Smirl 0 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-09 19:00 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, git On Fri, 9 Jun 2006, Jon Smirl wrote: > > That looks too small. My svn git import is 2.7GB and the source CVS is > 3.0GB. The svn import wasn't finished when I stopped it. Git is much better at packing than either CVS or SVN. Get used to it ;) > My cvsps process is still running from last night. The error file is > 341MB. How big is it when the conversion is finished? My machine is > swapping to death. Do you have all the cvsps patches? There's a few important ones floating around, and David Mansfield never did a 2.2 release.. I'm pretty sure Martin doesn't run plain 2.1. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 19:00 ` Linus Torvalds @ 2006-06-09 20:17 ` Jon Smirl 2006-06-09 20:40 ` Linus Torvalds ` (3 more replies) 0 siblings, 4 replies; 67+ messages in thread From: Jon Smirl @ 2006-06-09 20:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, git On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Fri, 9 Jun 2006, Jon Smirl wrote: > > > > That looks too small. My svn git import is 2.7GB and the source CVS is > > 3.0GB. The svn import wasn't finished when I stopped it. > > Git is much better at packing than either CVS or SVN. Get used to it ;) The git tree that Martin got from cvsps is much smaller that the git tree I got from going to svn then to git. I don't why the trees are 700KB different, it may be different amounts of packing, or one of the conversion tools is losing something. Earlier he said: >git-repack -a -d but it OOMs on my 2GB+2GBswap machine :( > > My cvsps process is still running from last night. The error file is > > 341MB. How big is it when the conversion is finished? My machine is > > swapping to death. > > Do you have all the cvsps patches? There's a few important ones floating > around, and David Mansfield never did a 2.2 release.. I am running cvsps-2.1-3.fc5 so I may be wasting my time. Error out is 535MB now. He sent me some git patches, but none for cvsps. > I'm pretty sure Martin doesn't run plain 2.1. I haven't come up with anything that is likely to result in Mozilla switching over to git. Right now it takes three days to convert the tree. The tree will have to be run in parallel for a while to convince everyone to switch. I don't have a solution to keeping it in sync in near real time (commits would still go to CVS). Most Mozilla developers are interested but the infrastructure needs some help. Martin has also brought up the problem with needing a partial clone so that everyone doesn't have to bring down the entire repository. A trunk checkout is 340MB and Martin's git tree is 2GB (mine 2.7GB). A kernel tree is only 680M. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 20:17 ` Jon Smirl @ 2006-06-09 20:40 ` Linus Torvalds 2006-06-09 20:56 ` Jon Smirl 2006-06-09 20:44 ` Jakub Narebski ` (2 subsequent siblings) 3 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-09 20:40 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, git On Fri, 9 Jun 2006, Jon Smirl wrote: > > > Git is much better at packing than either CVS or SVN. Get used to it ;) > > The git tree that Martin got from cvsps is much smaller that the git > tree I got from going to svn then to git. I don't why the trees are > 700KB different, it may be different amounts of packing, or one of the > conversion tools is losing something. .. or one of them is adding something. For example, it may well be that cvs2svn does a lot more commits or something like that. That said, I don't even see where git-svn packs anythign at all, and you're absolutely right that when/how you repack can make a huge difference to disk usage, much more so than any importer details. > > Do you have all the cvsps patches? There's a few important ones floating > > around, and David Mansfield never did a 2.2 release.. > > I am running cvsps-2.1-3.fc5 so I may be wasting my time. Error out is > 535MB now. > He sent me some git patches, but none for cvsps. I've got a couple, but I was hoping David would do a cvsps-2.2. I have this dim memory of him saying he had done some other improvements too. > I haven't come up with anything that is likely to result in Mozilla > switching over to git. Right now it takes three days to convert the > tree. The tree will have to be run in parallel for a while to convince > everyone to switch. I don't have a solution to keeping it in sync in > near real time (commits would still go to CVS). Most Mozilla > developers are interested but the infrastructure needs some help. Sure. That said, I pretty much guarantee that the size issues will be much much worse for any other distributed SCM. If Mozilla doesn't need the distributed thing, then SVN is probably the best choice. It's still a total piece of crap, but hey, if crap (== centralized) is what people are used to, a few billion flies can't be wrong ;) If you got your import done, is there some place I can rsync it from, and at least I can make sure that everything works fine for a repo that size.. One day the Mozilla people will notice that they really _really_ want the distribution, and they'll figure out quickly enough that SVK doesn't cut it, I suspect. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 20:40 ` Linus Torvalds @ 2006-06-09 20:56 ` Jon Smirl 2006-06-09 21:57 ` Linus Torvalds 0 siblings, 1 reply; 67+ messages in thread From: Jon Smirl @ 2006-06-09 20:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, git On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > > I haven't come up with anything that is likely to result in Mozilla > > switching over to git. Right now it takes three days to convert the > > tree. The tree will have to be run in parallel for a while to convince > > everyone to switch. I don't have a solution to keeping it in sync in > > near real time (commits would still go to CVS). Most Mozilla > > developers are interested but the infrastructure needs some help. > > Sure. That said, I pretty much guarantee that the size issues will be much > much worse for any other distributed SCM. > > If Mozilla doesn't need the distributed thing, then SVN is probably the > best choice. It's still a total piece of crap, but hey, if crap (== > centralized) is what people are used to, a few billion flies can't be > wrong ;) They need the distributed thing whether they realize it or not. Some of the external projects like songbird and nvu are vulnerable to drift since they are running their own repositories. Once a few move/renames happen they can't easily stay in sync anymore. It has been over a year since NVU was merged back into the trunk. That is the same reason I want it, so that I can work on stuff locally and have a repository. The core staff doesn't have this problem because they can make all the branches they want in the main repository. > If you got your import done, is there some place I can rsync it from, and > at least I can make sure that everything works fine for a repo that size.. > One day the Mozilla people will notice that they really _really_ want the > distribution, and they'll figure out quickly enough that SVK doesn't cut > it, I suspect. It would be better to rsync Martins copy, he has a lot more bandwidth. It will take over a day to copy it off my cable modem. I'm signed up to get FIOS as soon as they turn it on in my neighborhood, it's already wired on the poles. > > Linus > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 20:56 ` Jon Smirl @ 2006-06-09 21:57 ` Linus Torvalds 2006-06-09 22:17 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Linus Torvalds @ 2006-06-09 21:57 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, git On Fri, 9 Jun 2006, Jon Smirl wrote: > > They need the distributed thing whether they realize it or not. Some > of the external projects like songbird and nvu are vulnerable to drift > since they are running their own repositories. Once a few > move/renames happen they can't easily stay in sync anymore. It has > been over a year since NVU was merged back into the trunk. > > That is the same reason I want it, so that I can work on stuff locally > and have a repository. The core staff doesn't have this problem > because they can make all the branches they want in the main > repository. Yes. Anyway, I think we'll get git working well for repositories that size, and eventually the core developers will notice how much better it is. In the meantime, the fact that git-cvsimport can be done incrementally means that once we have the silly pack-file-mapping details worked out, it should be perfectly fine to run the 3-day import just once, and then work on it incrementally afterwards without any real problems. So people like you who want to work on it off-line using a distributed system _can_ do so, realistically. Maybe not practically _today_, but I don't think the git issues are serious enough that we'd be talking about "months from now", but more of a "in a week or so we migh have something that works fine for your case". [ They had this long discussion about languages on #monotone the other day, and the reason I'll take C over anything else any day is the fact that a well-written C program is literally only limited by hardware, never by the language. The poor python/perl guys may write things more quickly, but when they hit a language wall, they hit it. I think we've got an excellent data model, and handling even something huge like the _whole_ history of mozilla doesn't look very daunting at all. I just want to have a real test-case to motivate me to look at the problems. ] > It would be better to rsync Martins copy, he has a lot more bandwidth. > It will take over a day to copy it off my cable modem. I'm signed up > to get FIOS as soon as they turn it on in my neighborhood, it's > already wired on the poles. Sure. I actually just have regular 128kbps DSL myself. I guess I should upgrade to 256 (the downside of having deer munching on the roses in our back yard is that I don't think I even have the option for anything faster), but I'm so damn well distributed that the slow 128kbps is actually more than enough - everything serious I do is local anyway. So it will take me quite some time to download 2GB+, regardless of how fat a pipe the other end has ;) Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 21:57 ` Linus Torvalds @ 2006-06-09 22:17 ` Linus Torvalds 2006-06-09 23:16 ` Greg KH 2006-06-09 23:37 ` Martin Langhoff 2 siblings, 0 replies; 67+ messages in thread From: Linus Torvalds @ 2006-06-09 22:17 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, git On Fri, 9 Jun 2006, Linus Torvalds wrote: > > Sure. I actually just have regular 128kbps DSL myself. Not bits, bytes. 128_KB_/s, of course. Actually, it's slightly more. Something like 146KB/s, I guess that comes to 1.5Mbps. Just in case somebody thought I was living in a cave in the middle ages. Anyway, no nice 5Mbps cable for me. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 21:57 ` Linus Torvalds 2006-06-09 22:17 ` Linus Torvalds @ 2006-06-09 23:16 ` Greg KH 2006-06-09 23:37 ` Martin Langhoff 2 siblings, 0 replies; 67+ messages in thread From: Greg KH @ 2006-06-09 23:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jon Smirl, Martin Langhoff, git On Fri, Jun 09, 2006 at 02:57:58PM -0700, Linus Torvalds wrote: > > It would be better to rsync Martins copy, he has a lot more bandwidth. > > It will take over a day to copy it off my cable modem. I'm signed up > > to get FIOS as soon as they turn it on in my neighborhood, it's > > already wired on the poles. > > So it will take me quite some time to download 2GB+, regardless of how fat > a pipe the other end has ;) Fed-Ex a DVD or two would probably be fastest :) ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 21:57 ` Linus Torvalds 2006-06-09 22:17 ` Linus Torvalds 2006-06-09 23:16 ` Greg KH @ 2006-06-09 23:37 ` Martin Langhoff 2006-06-09 23:43 ` Linus Torvalds 2 siblings, 1 reply; 67+ messages in thread From: Martin Langhoff @ 2006-06-09 23:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jon Smirl, git Apologies, I dropped out of the conversation -- Friday night drinks (NZ timezone) took over ;-) Now, back on track... On 6/10/06, Linus Torvalds <torvalds@osdl.org> wrote: > In the meantime, the fact that git-cvsimport can be done incrementally > means that once we have the silly pack-file-mapping details worked out, it > should be perfectly fine to run the 3-day import just once, and then work > on it incrementally afterwards without any real problems. Exactly. The dog at this time is cvsps -- I also remember vague promises from a list regular of publishing a git repo with cvsps2.1 + some patches from the list. In any case, and for the record, my cvsps is 2.1 pristine. It handles the mozilla repo alright, as long as I give it a lot of RAM. I _think_ it slurped 3GB with the mozilla cvs. I want to review that cvs2svn importer, probably to steal the test cases and perhaps some logic to revamp/replace cvsps. The thing is -- we can't just drop/replace cvsimport because it does incrementals, so continuity and consistency are key. All the CVS imports have to take some hard decisions when the data is bad -- however it is we fudge it, we kind of want to fudge it consistently ;-) > So people like you who want to work on it off-line using a distributed > system _can_ do so, realistically. Maybe not practically _today_ Other than "don't run repack -a", it's feasible. In fact, that's how I use git 99% of the time -- to do DSCM stuff on projects that are using CVS, like Moodle. > The poor python/perl guys may write things more > quickly, but when they hit a language wall, they hit it. Flamebait anyone? ;-) It is a different kind of fun -- let's say that on top of knowing the performance tricks (or, to be more hip: "design patterns") for the hardware and OS, you also end up learning the performance tricks of the interpreter/vm/whatever. > > It would be better to rsync Martins copy, he has a lot more bandwidth. I'm coming down to the office now to pick up my laptop, and I'll rsync it out to our git machine (also NZ kernel mirror, bandwidth should be good). That's one of the things I've discovered with these large trees: for the initial publish action, I just use rsync or scp. Perhaps I'm doing it wrong, but git-push doesn't optimise the 'initialise repo', and it take ages (and it this case, it'd probably OOM). > So it will take me quite some time to download 2GB+, regardless of how fat > a pipe the other end has ;) Right-o. Linus, Jon, can you guys then ping me when you have cloned it safely so I can take it down again? cheers, martin ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 23:37 ` Martin Langhoff @ 2006-06-09 23:43 ` Linus Torvalds 2006-06-10 0:00 ` Jon Smirl 0 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-09 23:43 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jon Smirl, git On Sat, 10 Jun 2006, Martin Langhoff wrote: > > Exactly. The dog at this time is cvsps -- I also remember vague > promises from a list regular of publishing a git repo with cvsps2.1 + > some patches from the list. Ahh. cvsps doesn't do anything incrementally, does it? Although it _does_ build up a cache of sorts, I think. That's not the parts I actually ever ended up looking at. But yeah, a cvsps that blows up to a gig of VM and takes half an hour to parse things just for an incremental update would be a problem. > In any case, and for the record, my cvsps is 2.1 pristine. It handles > the mozilla repo alright, as long as I give it a lot of RAM. I _think_ > it slurped 3GB with the mozilla cvs. Oh, wow. Every single repo I've seen ends up having tons of complaints from pristine cvsps, but maybe that's because I only end up looking at the ones with problems ;) > I'm coming down to the office now to pick up my laptop, and I'll rsync > it out to our git machine (also NZ kernel mirror, bandwidth should be > good). That's one of the things I've discovered with these large > trees: for the initial publish action, I just use rsync or scp. > Perhaps I'm doing it wrong, but git-push doesn't optimise the > 'initialise repo', and it take ages (and it this case, it'd probably > OOM). > > > So it will take me quite some time to download 2GB+, regardless of how fat > > a pipe the other end has ;) > > Right-o. Linus, Jon, can you guys then ping me when you have cloned it > safely so I can take it down again? Tell me where/when it is, and I'll start slurping. Will let you know when I'm done. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 23:43 ` Linus Torvalds @ 2006-06-10 0:00 ` Jon Smirl 2006-06-10 0:11 ` Linus Torvalds 0 siblings, 1 reply; 67+ messages in thread From: Jon Smirl @ 2006-06-10 0:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, git On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > On Sat, 10 Jun 2006, Martin Langhoff wrote: > > In any case, and for the record, my cvsps is 2.1 pristine. It handles > > the mozilla repo alright, as long as I give it a lot of RAM. I _think_ > > it slurped 3GB with the mozilla cvs. > > Oh, wow. Every single repo I've seen ends up having tons of complaints > from pristine cvsps, but maybe that's because I only end up looking at the > ones with problems ;) Are we sure cvsps is ok? It is generating 500MB of warnings when I run it. I have cvsps running at dreamhost currently. I had to modify cvs, cvps, git, etc to not repsond to signals to keep them from killing everything. I can clone 2GB git tree there. Let me know when it is up. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 0:00 ` Jon Smirl @ 2006-06-10 0:11 ` Linus Torvalds 2006-06-10 0:16 ` Jon Smirl 0 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-10 0:11 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, git On Fri, 9 Jun 2006, Jon Smirl wrote: > > Are we sure cvsps is ok? It is generating 500MB of warnings when I run it. Do they go away with these patches? Linus --- commit 3d1ebcef6b4f9f6c9064efd64da4dd30d93c3c96 Author: Linus Torvalds <torvalds@g5.osdl.org> Date: Wed Mar 22 17:20:20 2006 -0800 Fix branch ancestor calculation Not having any ancestor at all means that any valid ancestor (even of "depth 0") is fine. Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff --git a/cvsps.c b/cvsps.c index c22147e..2695a0f 100644 --- a/cvsps.c +++ b/cvsps.c @@ -2599,7 +2599,7 @@ static void determine_branch_ancestor(Pa * note: rev is the pre-commit revision, not the post-commit */ if (!head_ps->ancestor_branch) - d1 = 0; + d1 = -1; else if (strcmp(ps->branch, rev->branch) == 0) continue; else if (strcmp(head_ps->ancestor_branch, "HEAD") == 0) commit 82fcf7e31bbeae3b01a8656549e9b8fd89d598eb Author: Linus Torvalds <torvalds@g5.osdl.org> Date: Wed Mar 22 11:23:37 2006 -0800 Improve handling of file collisions in the same patchset Take the file revision into account. diff --git a/cvsps.c b/cvsps.c index 1e64e3c..c22147e 100644 --- a/cvsps.c +++ b/cvsps.c @@ -2384,8 +2384,31 @@ void patch_set_add_member(PatchSet * ps, for (next = ps->members.next; next != &ps->members; next = next->next) { PatchSetMember * m = list_entry(next, PatchSetMember, link); - if (m->file == psm->file && ps->collision_link.next == NULL) - list_add(&ps->collision_link, &collisions); + if (m->file == psm->file) { + int order = compare_rev_strings(psm->post_rev->rev, m->post_rev->rev); + + /* + * Same revision too? Add it to the collision list + * if it isn't already. + */ + if (!order) { + if (ps->collision_link.next == NULL) + list_add(&ps->collision_link, &collisions); + return; + } + + /* + * If this is an older revision than the one we already have + * in this patchset, just ignore it + */ + if (order < 0) + return; + + /* + * This is a newer one, remove the old one + */ + list_del(&m->link); + } } psm->ps = ps; commit 534120d9a47062eecd7b53fd7ac0b70d97feb4fd Author: Linus Torvalds <torvalds@g5.osdl.org> Date: Wed Mar 22 11:20:59 2006 -0800 Increase log-length limit to 64kB Yeah, it should be dynamic. I'm lazy. diff --git a/cvsps_types.h b/cvsps_types.h index b41e2a9..dba145d 100644 --- a/cvsps_types.h +++ b/cvsps_types.h @@ -8,7 +8,7 @@ #define CVSPS_TYPES_H #include <time.h> -#define LOG_STR_MAX 32768 +#define LOG_STR_MAX 65536 #define AUTH_STR_MAX 64 #define REV_STR_MAX 64 #define MIN(a, b) ((a) < (b) ? (a) : (b)) ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 0:11 ` Linus Torvalds @ 2006-06-10 0:16 ` Jon Smirl 2006-06-10 0:45 ` Jon Smirl 0 siblings, 1 reply; 67+ messages in thread From: Jon Smirl @ 2006-06-10 0:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, git I'll apply and give it a test. They look like this for most of them. WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM: security/coreconf/HP-UX.mk:1.8=after, security/jss/org/mozilla/jss/crypto/KeyPairAlgorithm.java:1.5=before. Treated as 'before' WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM: security/coreconf/HP-UX.mk:1.8=after, security/jss/org/mozilla/jss/crypto/KeyPairGenerator.java:1.5=before. Treated as 'before' WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM: security/coreconf/HP-UX.mk:1.8=after, security/jss/org/mozilla/jss/crypto/KeyPairGeneratorSpi.java:1.3=before. Treated as 'before' WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM: security/coreconf/HP-UX.mk:1.8=after, security/jss/org/mozilla/jss/crypto/KeyWrapAlgorithm.java:1.8=before. Treated as 'before' WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM: security/coreconf/HP-UX.mk:1.8=after, security/jss/org/mozilla/jss/crypto/KeyWrapper.java:1.8=before. Treated as 'before' WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM: security/coreconf/HP-UX.mk:1.8=after, security/jss/org/mozilla/jss/crypto/Makefile:1.2=before. Treated as 'before' WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM: security/coreconf/HP-UX.mk:1.8=after, security/jss/org/mozilla/jss/crypto/NoSuchItemOnTokenException.java:1.3=before. Treated as 'before' -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 0:16 ` Jon Smirl @ 2006-06-10 0:45 ` Jon Smirl 0 siblings, 0 replies; 67+ messages in thread From: Jon Smirl @ 2006-06-10 0:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, git They must be running some kind of process accounting at my host. As soon as I hit 500MB RAM I get killed immediately. It is not from a signal, I'm catching all of those. Maybe some kind of process accounting. I get this on the console: [1]+ Killed CVSROOT=~/jonsmirl.dreamhosters.com/mozilla/ cvsps -x --norc -A mozilla >mozilla.cvsps 2>mozilla.cvspserr and nothing on stdout or stderr. kernel string: 2.4.29-grsec+w+fhs6b+gr0501+nfs+a32+++p4+sata+c4+gr2b-v6.189 -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 20:17 ` Jon Smirl 2006-06-09 20:40 ` Linus Torvalds @ 2006-06-09 20:44 ` Jakub Narebski 2006-06-09 21:05 ` Nicolas Pitre 2006-06-10 1:23 ` Martin Langhoff 3 siblings, 0 replies; 67+ messages in thread From: Jakub Narebski @ 2006-06-09 20:44 UTC (permalink / raw) To: git Jon Smirl wrote: > Martin has also brought up the problem with needing a partial clone so > that everyone doesn't have to bring down the entire repository. A > trunk checkout is 340MB and Martin's git tree is 2GB (mine 2.7GB). A > kernel tree is only 680M. Partial/shallow nor lazy clone we don't have (although there might be some shallow clone partial solutions in topic branches and/or patches flying around in git mailing list). Yet. But you can do what was done for Linux kernel: split repository into current and historical, and you can join them (join the history) if needed using grafts. And even if one need historical repository, it is neede to clone/copy only _once_. With alternatives (using historical repository as one of alternatives for current repository) someone who has both repositories does need only a little more space, I think, than if one used single repository. -- Jakub Narebski Warsaw, Poland ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 20:17 ` Jon Smirl 2006-06-09 20:40 ` Linus Torvalds 2006-06-09 20:44 ` Jakub Narebski @ 2006-06-09 21:05 ` Nicolas Pitre 2006-06-09 21:46 ` Jon Smirl 2006-06-10 1:23 ` Martin Langhoff 3 siblings, 1 reply; 67+ messages in thread From: Nicolas Pitre @ 2006-06-09 21:05 UTC (permalink / raw) To: Jon Smirl; +Cc: Linus Torvalds, Martin Langhoff, git On Fri, 9 Jun 2006, Jon Smirl wrote: > I haven't come up with anything that is likely to result in Mozilla > switching over to git. Right now it takes three days to convert the > tree. The tree will have to be run in parallel for a while to convince > everyone to switch. I don't have a solution to keeping it in sync in > near real time (commits would still go to CVS). Most Mozilla > developers are interested but the infrastructure needs some help. This is true. GIT is still evolving and certainly needs work to cope with environments and datasets that were never tested before. The Mozilla repo is one of those and we're certainly interested into making it work well. GIT might not be right for it just yet, but if you could let us rsync your converted repo to play with that might help us work on proper fixes for that kind of repo. > Martin has also brought up the problem with needing a partial clone so > that everyone doesn't have to bring down the entire repository. If it can be repacked into a single pack that size might get much smaller too. Nicolas ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 21:05 ` Nicolas Pitre @ 2006-06-09 21:46 ` Jon Smirl 0 siblings, 0 replies; 67+ messages in thread From: Jon Smirl @ 2006-06-09 21:46 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Martin Langhoff, git On 6/9/06, Nicolas Pitre <nico@cam.org> wrote: > On Fri, 9 Jun 2006, Jon Smirl wrote: > > > I haven't come up with anything that is likely to result in Mozilla > > switching over to git. Right now it takes three days to convert the > > tree. The tree will have to be run in parallel for a while to convince > > everyone to switch. I don't have a solution to keeping it in sync in > > near real time (commits would still go to CVS). Most Mozilla > > developers are interested but the infrastructure needs some help. > > This is true. GIT is still evolving and certainly needs work to cope > with environments and datasets that were never tested before. The > Mozilla repo is one of those and we're certainly interested into making > it work well. GIT might not be right for it just yet, but if you could > let us rsync your converted repo to play with that might help us work on > proper fixes for that kind of repo. I'm rebuilding it on my shared hosting account at dreamhost.com. I'll see if I can get it built before they notice and kill my process. My account there is on a 4GB quad xeon box so hopefully it can convert the tree faster. My account has 1TB download per month so rsync will be ok. Not bad for $12 the first year. It would take over a day to rsync it off from my home machine. > > Martin has also brought up the problem with needing a partial clone so > > that everyone doesn't have to bring down the entire repository. > > If it can be repacked into a single pack that size might get much > smaller too. > > > Nicolas > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 20:17 ` Jon Smirl ` (2 preceding siblings ...) 2006-06-09 21:05 ` Nicolas Pitre @ 2006-06-10 1:23 ` Martin Langhoff 3 siblings, 0 replies; 67+ messages in thread From: Martin Langhoff @ 2006-06-10 1:23 UTC (permalink / raw) To: Jon Smirl; +Cc: Linus Torvalds, git On 6/10/06, Jon Smirl <jonsmirl@gmail.com> wrote: > The git tree that Martin got from cvsps is much smaller that the git > tree I got from going to svn then to git. I don't why the trees are > 700KB different, it may be different amounts of packing, or one of the > conversion tools is losing something. Don't read too much into that. Packing/repacking points make a _huge_ difference, and even if one of our trees is a bit corrupt, the packsizes should be about the same. (With the patches I sent you we _are_ choosing to ignore a few branches that don't seem to make sense in cvsps output. These will show up in the error output -- what I saw were very old, possibly corrupt branches there, stuff I wouldn't shed a tear over, but it is worth reviewing). > I haven't come up with anything that is likely to result in Mozilla > switching over to git. Right now it takes three days to convert the > tree. The tree will have to be run in parallel for a while to convince > everyone to switch. I don't have a solution to keeping it in sync in > near real time (commits would still go to CVS). Most Mozilla > developers are interested but the infrastructure needs some help. Don't worry about the initial import time. Once you've done it, you can run the incremental import (which will take a few minutes) even hourly to keep 'in sync'. > Martin has also brought up the problem with needing a partial clone so > that everyone doesn't have to bring down the entire repository. A > trunk checkout is 340MB and Martin's git tree is 2GB (mine 2.7GB). A > kernel tree is only 680M. Now that I have managed to repack the repo, it is indeed back in the 600M range. Actually, I just re-repacked, it took under a minute, and it shrank down to 607MB. Yay. I'm sure that if you git-repack -a -d on a machine with plenty of memory once or twice, we'll have matching packs. cheers, martin ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 3:06 ` Martin Langhoff 2006-06-09 3:28 ` Jon Smirl 2006-06-09 18:13 ` Jon Smirl @ 2006-06-10 1:14 ` Martin Langhoff 2006-06-10 1:33 ` Linus Torvalds 2 siblings, 1 reply; 67+ messages in thread From: Martin Langhoff @ 2006-06-10 1:14 UTC (permalink / raw) To: Jon Smirl; +Cc: git On 6/9/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > mozilla.git$ du -sh .git/ > 2.0G .git/ Ok -- pushed the repository out to our mirror box. Try: git-clone http://mirrors.catalyst.net.nz/pub/mozilla.git/ Now, good news. No, _very_ good news. As I was rsync'ing this out, and looking at the repo, suddently something was odd. Apparently after a git-repack -a -d OOMd on me, and I had posted this message, I re-ran it. [As it happens I have been running several imports of gentoo and moz lately on thebox. It is entirely possible that cvsps or a stray git-cvsimport was sitting on a whole lot of ram at the time] Now I don't know how much memory or time this took, but it clearly completed ok. And, it's now a single pack, weighting a grand total of 617MB So my comments about OOM'ing were wrong apparently. Hey, if the whole history is actually only 617MB, then initial checkouts are back to something reasonable, I'd say. cheers, martin ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:14 ` Martin Langhoff @ 2006-06-10 1:33 ` Linus Torvalds 2006-06-10 1:43 ` Linus Torvalds 2006-06-11 22:00 ` Nicolas Pitre 0 siblings, 2 replies; 67+ messages in thread From: Linus Torvalds @ 2006-06-10 1:33 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jon Smirl, git On Sat, 10 Jun 2006, Martin Langhoff wrote: > > Now I don't know how much memory or time this took, but it clearly > completed ok. And, it's now a single pack, weighting a grand total of > 617MB Ok, that's more than reasonable. That should be fairly easily mapped on a 32-bit architecture without any huge problems, even with some VM fragmentation going on. It might be borderline (and you definitely want a 3:1 VM user:kernel split), but considering that the original CVS archive was apparently 3GB, having a single 617M pack-file is still pretty damn good. That's like 20% of the original, with all the obvious distribution advantages. Clearly this whole thing _does_ show that we could improve the process of importing things from CVS a whole lot, and I assume your 617MB pack doesn't have the nice name/email translations so it needs to be fixed up, but it sounds like on the whole the core git design came through with shining colors, even if we may want to polish things up a bit ;) I'm downloading the thing right now. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:33 ` Linus Torvalds @ 2006-06-10 1:43 ` Linus Torvalds 2006-06-10 1:48 ` Jon Smirl 2006-06-11 22:00 ` Nicolas Pitre 1 sibling, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-10 1:43 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jon Smirl, git On Fri, 9 Jun 2006, Linus Torvalds wrote: > > That's like 20% of the original, with all the obvious distribution > advantages. Btw, does anybody know roughly how much data a initial "cvs co" takes on the mozilla repo? Git will obviously get the whole history, and that will inevitably be bigger than getting a single check-out, but it's not necessarily orders of magnitude bigger. It could be that getting a whole git archive is not _that_ much more expnsive than getting a single version, considering how well history compresses (eg the kernel git arhive isn't orders of magnitude bigger than a single compressed tar-ball of the sources). At that point, it's probably a pretty usable alternative. (Although, to be fair, we almost certainly have to improve "git-rev-list --objects --all" performance on that thing, since that's going to otherwise make it totally impossible to do initial clones using the native git protocol, and make git look bad). Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:43 ` Linus Torvalds @ 2006-06-10 1:48 ` Jon Smirl 2006-06-10 1:59 ` Linus Torvalds 0 siblings, 1 reply; 67+ messages in thread From: Jon Smirl @ 2006-06-10 1:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, git On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Fri, 9 Jun 2006, Linus Torvalds wrote: > > > > That's like 20% of the original, with all the obvious distribution > > advantages. > > Btw, does anybody know roughly how much data a initial "cvs co" takes on > the mozilla repo? Git will obviously get the whole history, and that will > inevitably be bigger than getting a single check-out, but it's not > necessarily orders of magnitude bigger. 339MB for initial checkout > It could be that getting a whole git archive is not _that_ much more > expnsive than getting a single version, considering how well history > compresses (eg the kernel git arhive isn't orders of magnitude bigger than > a single compressed tar-ball of the sources). > > At that point, it's probably a pretty usable alternative. > > (Although, to be fair, we almost certainly have to improve "git-rev-list > --objects --all" performance on that thing, since that's going to > otherwise make it totally impossible to do initial clones using the native > git protocol, and make git look bad). > > Linus > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:48 ` Jon Smirl @ 2006-06-10 1:59 ` Linus Torvalds 2006-06-10 2:21 ` Jon Smirl ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Linus Torvalds @ 2006-06-10 1:59 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, git On Fri, 9 Jun 2006, Jon Smirl wrote: > > > > Btw, does anybody know roughly how much data a initial "cvs co" takes on > > the mozilla repo? Git will obviously get the whole history, and that will > > inevitably be bigger than getting a single check-out, but it's not > > necessarily orders of magnitude bigger. > > 339MB for initial checkout And I think people run :pserver: with compression by default, so we're likely talking about half that in actual download overhead, no? So a git clone would be about (wild handwaving, don't look at all the assumptions) four times as expensive - assuming we only look at a poor DSL line as the expense - as an initial CVS co, but you'd get the _whole_ history. Which may or may not make up for it. For some people it will, for others it won't. Of course, to make up for some of the initial costs, I suspect that some people who are used to "cvs update" taking 15 minutes to update two files, it would be a serious relief to see the git kind of "300 objects in five seconds" kinds of pulls. Although I guess that's one of the CVS things that SVN improved on. At least I'd hope so ;/ Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:59 ` Linus Torvalds @ 2006-06-10 2:21 ` Jon Smirl 2006-06-10 2:34 ` Carl Worth 2006-06-10 3:01 ` Linus Torvalds 2006-06-10 2:30 ` Jon Smirl 2006-06-10 3:41 ` Martin Langhoff 2 siblings, 2 replies; 67+ messages in thread From: Jon Smirl @ 2006-06-10 2:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, git On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Fri, 9 Jun 2006, Jon Smirl wrote: > > > > > > Btw, does anybody know roughly how much data a initial "cvs co" takes on > > > the mozilla repo? Git will obviously get the whole history, and that will > > > inevitably be bigger than getting a single check-out, but it's not > > > necessarily orders of magnitude bigger. > > > > 339MB for initial checkout > > And I think people run :pserver: with compression by default, so we're > likely talking about half that in actual download overhead, no? > > So a git clone would be about (wild handwaving, don't look at all the > assumptions) four times as expensive - assuming we only look at a poor DSL > line as the expense - as an initial CVS co, but you'd get the _whole_ > history. Which may or may not make up for it. For some people it will, for > others it won't. Could you clone the repo and delete changesets earlier than 2004? Then I would clone the small repo and work with it. Later I decide I want full history, can I pull from a full repository at that point and get updated? That would need a flag to trigger it since I don't want full history to come over if I am just getting updates from someone else's tree that has a full history. > > Of course, to make up for some of the initial costs, I suspect that some > people who are used to "cvs update" taking 15 minutes to update two files, > it would be a serious relief to see the git kind of "300 objects in five > seconds" kinds of pulls. No more cvs diff taking four minutes to finish. I have to do that every time I want to generate a 10 line patch. Diffs can run locally. No more cvs update to replace files I deleted because I messed up edits in them. And I can have local branches, yeah! What are we going to do about the BEOS developers on Mozilla? There are a couple more obscure OSes. > Although I guess that's one of the CVS things that SVN improved on. At > least I'd hope so ;/ > > Linus > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 2:21 ` Jon Smirl @ 2006-06-10 2:34 ` Carl Worth 2006-06-10 3:08 ` Linus Torvalds 2006-06-10 3:01 ` Linus Torvalds 1 sibling, 1 reply; 67+ messages in thread From: Carl Worth @ 2006-06-10 2:34 UTC (permalink / raw) To: Jon Smirl; +Cc: Linus Torvalds, Martin Langhoff, git [-- Attachment #1: Type: text/plain, Size: 1121 bytes --] On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote: > > Could you clone the repo and delete changesets earlier than 2004? Then > I would clone the small repo and work with it. Later I decide I want > full history, can I pull from a full repository at that point and get > updated? That would need a flag to trigger it since I don't want full > history to come over if I am just getting updates from someone else's > tree that has a full history. This is clearly a desirable feature, and has been requested by several people (including myself) looking to switch some large-ish histories from an existing system to git. If you'd like to look through git archives for some discussion of the issues that would be involved here, look for "shallow clone". There's a related proposal termed "lazy clone" for one that would pull down missing objects as needed over the network. My impression is that both things will eventually be implemented. There's certainly nothing fundamental in git that will prevent them, (though there will be some interesting things to resolve as a real patch for this stuff is explored). -Carl [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 2:34 ` Carl Worth @ 2006-06-10 3:08 ` Linus Torvalds 2006-06-10 8:21 ` Jakub Narebski 2006-06-10 8:36 ` Rogan Dawes 0 siblings, 2 replies; 67+ messages in thread From: Linus Torvalds @ 2006-06-10 3:08 UTC (permalink / raw) To: Carl Worth; +Cc: Jon Smirl, Martin Langhoff, git On Fri, 9 Jun 2006, Carl Worth wrote: > On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote: > > > > Could you clone the repo and delete changesets earlier than 2004? Then > > I would clone the small repo and work with it. Later I decide I want > > full history, can I pull from a full repository at that point and get > > updated? That would need a flag to trigger it since I don't want full > > history to come over if I am just getting updates from someone else's > > tree that has a full history. > > This is clearly a desirable feature, and has been requested by several > people (including myself) looking to switch some large-ish histories > from an existing system to git. The thing is, to some degree it's really fundamentally hard. It's easy for a linear history. What you do for a linear history is to just get the top commit, and the tree associated with it, and then you cauterize the parent by just grafting it to go away. Boom. You're done. The problems are that if the preceding history _wasn't_ linear (or, in fact, _subsequent_ development refers to it by having branched off at an earlier point), and you try to pull your updates, the other end (that knows about all the history) will assume you have all the history that you don't have, and will send you a pack assuming that. Which won't even necessarily have all the tree/blob objects (it assumed you already had them), but more annoyingly, the history won't be cauterized, and you'll have dangling commits. Which you can cauterize by hand, of course, but you literally _will_ have to get the objects and cauterize the thing by hand. You're right that it's not "fundamentally impossible" to do: the git format certainly _allows_ it. But the git protocol handshake really does end up optimizing away all the unnecessary work by knowing that the other side will have all the shared history, so lacking the shared history will mean that you're a bit screwed. Using the http protocol actually works. It doesn't do any handshake: it will just fetch objects from the other end as it needs them. The downside, of course, is that it also doesn't understand packs, so if the source is packed (and it pretty much _will_ be, for any big source), you're going to end up getting it all _anyway_. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 3:08 ` Linus Torvalds @ 2006-06-10 8:21 ` Jakub Narebski 2006-06-10 9:00 ` Junio C Hamano 2006-06-10 8:36 ` Rogan Dawes 1 sibling, 1 reply; 67+ messages in thread From: Jakub Narebski @ 2006-06-10 8:21 UTC (permalink / raw) To: git Linus Torvalds wrote: > On Fri, 9 Jun 2006, Carl Worth wrote: > >> On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote: >> > >> > Could you clone the repo and delete changesets earlier than 2004? Then >> > I would clone the small repo and work with it. Later I decide I want >> > full history, can I pull from a full repository at that point and get >> > updated? That would need a flag to trigger it since I don't want full >> > history to come over if I am just getting updates from someone else's >> > tree that has a full history. >> >> This is clearly a desirable feature, and has been requested by several >> people (including myself) looking to switch some large-ish histories >> from an existing system to git. > > The thing is, to some degree it's really fundamentally hard. > > It's easy for a linear history. What you do for a linear history is to > just get the top commit, and the tree associated with it, and then you > cauterize the parent by just grafting it to go away. Boom. You're done. > > The problems are that if the preceding history _wasn't_ linear (or, in > fact, _subsequent_ development refers to it by having branched off at an > earlier point), and you try to pull your updates, the other end (that > knows about all the history) will assume you have all the history that you > don't have, and will send you a pack assuming that. Couldn't it be solved by enhancing initial handshake to send from puller (object receivier) to pullee (object sender) the contents of graft file, or better the contents of cauterizing graft file - without splitting graft file we better have an option to send graft file or not, when graft file is used to join historical repository line of development not to cauterize history. Then the sender would use sent cauterizing history graft file for calculating which objects to sedn _only_, "in memory" cauterizing it's own history. Main disadvantage is if one cauterized history too eagerly, and shallow clone history can lack merge bases, and have no way to get them _simply_ using this approach... Now I guess you would tell me why this very simple idea is stupid... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 8:21 ` Jakub Narebski @ 2006-06-10 9:00 ` Junio C Hamano 0 siblings, 0 replies; 67+ messages in thread From: Junio C Hamano @ 2006-06-10 9:00 UTC (permalink / raw) To: Jakub Narebski; +Cc: git Jakub Narebski <jnareb@gmail.com> writes: > Couldn't it be solved by enhancing initial handshake to send from puller > (object receivier) to pullee (object sender) the contents of graft file, or > better the contents of cauterizing graft file - without splitting graft > file we better have an option to send graft file or not, when graft file is > used to join historical repository line of development not to cauterize > history. > > Then the sender would use sent cauterizing history graft file for > calculating which objects to sedn _only_, "in memory" cauterizing it's own > history. > > Now I guess you would tell me why this very simple idea is stupid... It is not stupid at all; what you said is actually on a correct track. You indeed just reinvented a half of what I've outlined earlier for implementing shallow clone (the other half you missed is that the graft exchange needs to happen both ways, limiting the commit ancestry graph the both ends walk to the intersection of the fake view of the ancestry graph both ends have, but that is a minor detail). The problem is that what Linus described as "fundamentally hard" is not the initial "shallow clone" stage, but lies elsewhere. Namely, what to do after you create such a shallow clone and when you want to unplug an earlier cauterization points. In order to unplug a cauterization point (a commit we faked to be parentless earlier, whose parents and associated objects we ought to have but we do not because we made a shallow clone), the downloader needs to re-fetch that commit while temporarily pretending that it does not have any objects that are newer, perhaps defining another earlier point as a new cauterization point at the same time. Git format allows for that, and the protocol exchange certainly can be extensible to support something like that, but the design work would be quite involved. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 3:08 ` Linus Torvalds 2006-06-10 8:21 ` Jakub Narebski @ 2006-06-10 8:36 ` Rogan Dawes 2006-06-10 9:08 ` Junio C Hamano 2006-06-10 17:53 ` Linus Torvalds 1 sibling, 2 replies; 67+ messages in thread From: Rogan Dawes @ 2006-06-10 8:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jon Smirl, Martin Langhoff, git Linus Torvalds wrote: > > On Fri, 9 Jun 2006, Carl Worth wrote: > >> On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote: >>> Could you clone the repo and delete changesets earlier than 2004? Then >>> I would clone the small repo and work with it. Later I decide I want >>> full history, can I pull from a full repository at that point and get >>> updated? That would need a flag to trigger it since I don't want full >>> history to come over if I am just getting updates from someone else's >>> tree that has a full history. >> This is clearly a desirable feature, and has been requested by several >> people (including myself) looking to switch some large-ish histories >> from an existing system to git. > > The thing is, to some degree it's really fundamentally hard. > > It's easy for a linear history. What you do for a linear history is to > just get the top commit, and the tree associated with it, and then you > cauterize the parent by just grafting it to go away. Boom. You're done. > > The problems are that if the preceding history _wasn't_ linear (or, in > fact, _subsequent_ development refers to it by having branched off at an > earlier point), and you try to pull your updates, the other end (that > knows about all the history) will assume you have all the history that you > don't have, and will send you a pack assuming that. > > Which won't even necessarily have all the tree/blob objects (it assumed > you already had them), but more annoyingly, the history won't be > cauterized, and you'll have dangling commits. Which you can cauterize by > hand, of course, but you literally _will_ have to get the objects and > cauterize the thing by hand. > > You're right that it's not "fundamentally impossible" to do: the git > format certainly _allows_ it. But the git protocol handshake really does > end up optimizing away all the unnecessary work by knowing that the other > side will have all the shared history, so lacking the shared history will > mean that you're a bit screwed. Here's an idea. How about separating trees and commits from the actual blobs (e.g. in separate packs)? My reasoning is that the commits and trees should only be a small portion of the overall repository size, and should not be that expensive to transfer. (Of course, this is only a guess, and needs some numbers to back it up.) So, a shallow clone would receive all of the tree objects, and all of the commit objects, and could then request a pack containing the blobs represented by the current HEAD. In this way, the user has a history that will show all of the commit messages, and would be able to see _which_ files have changed over time e.g. gitk would still work - except for the actual file level diff, "git log" should also still work, etc This would also enable other optimisations. For example, documentation people would only need to get the objects under the doc/ tree, and would not need to actually check out the source. Git could detect any actual changes by checking whether it has the previous blob in its local repository, and whether the file exists locally. Creating a patch would obviously require that the person checks out the previous version, but one could theoretically commit a new blob to a repo without having the previous one (not saying that this would be a good idea, of course) This would probably require Eric Biederman's "direct access to blob" patches, I guess, in order to be feasible. Regards, Rogan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 8:36 ` Rogan Dawes @ 2006-06-10 9:08 ` Junio C Hamano 2006-06-10 14:47 ` Rogan Dawes 2006-06-10 17:53 ` Linus Torvalds 1 sibling, 1 reply; 67+ messages in thread From: Junio C Hamano @ 2006-06-10 9:08 UTC (permalink / raw) To: Rogan Dawes; +Cc: git Rogan Dawes <lists@dawes.za.net> writes: > Here's an idea. How about separating trees and commits from the actual > blobs (e.g. in separate packs)? If I remember my numbers correctly, trees for any project with a size that matters contribute nonnegligible amount of the total pack weight. Perhaps 10-25%. > In this way, the user has a history that will show all of the commit > messages, and would be able to see _which_ files have changed over > time e.g. gitk would still work - except for the actual file level > diff, "git log" should also still work, etc I suspect it would make a very unpleasant system to use. Sometimes "git diff -p" would show diffs, and other times it mysteriously complain saying that it lacks necessary blobs to do its job. You cannot even run fsck and tell from its output which missing objects are OK (because you chose to create such a sparse repository) and which are real corruption. A shallow clone with explicit cauterization in grafts file at least would not have that problem. Although the user will still not see the exact same result as what would happen in a full repository, at least we can say "your git log ends at that commit because your copy of the history does not go back beyond that" and the user would understand. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 9:08 ` Junio C Hamano @ 2006-06-10 14:47 ` Rogan Dawes 2006-06-10 14:58 ` Jakub Narebski 2006-06-10 15:14 ` Nicolas Pitre 0 siblings, 2 replies; 67+ messages in thread From: Rogan Dawes @ 2006-06-10 14:47 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Junio C Hamano wrote: > Rogan Dawes <lists@dawes.za.net> writes: > >> Here's an idea. How about separating trees and commits from the actual >> blobs (e.g. in separate packs)? > > If I remember my numbers correctly, trees for any project with a > size that matters contribute nonnegligible amount of the total > pack weight. Perhaps 10-25%. Out of curiosity, do you think that it may be possible for tree objects to compress more/better if they are packed together? Or does the existing pack compression logic already do the diff against similar tree objects? >> In this way, the user has a history that will show all of the commit >> messages, and would be able to see _which_ files have changed over >> time e.g. gitk would still work - except for the actual file level >> diff, "git log" should also still work, etc > > I suspect it would make a very unpleasant system to use. > Sometimes "git diff -p" would show diffs, and other times it > mysteriously complain saying that it lacks necessary blobs to do > its job. You cannot even run fsck and tell from its output > which missing objects are OK (because you chose to create such a > sparse repository) and which are real corruption. The fsck problem could be worked around by maintaining a list of objects that are explicitly not expected to be present. As the list gets shorter (perhaps as diffs are performed, other parts of the blob history are retrieved, etc), the list will get shorter until we have a complete clone of the original tree. Of course diffs against a version further back in the history would fail. But if you start with a checkout of a complete tree, any changes made since that point would at least have one version to compare against. In effect, what we would have is a caching repository (or as Jakub said, a lazy clone). An initial checkout would effectively be pre-seeding the cache. One does not necessarily even need to get the complete set of commit and tree objects, either. The bare minimum would probably be to get the HEAD commit, and the tree objects that correspond to that commit. At that point, one could populate the "uncached objects" list with the parent commits. One would not be in a position to get any history at all, of course. As the user performs various operations, e.g. git log, git could either go and fetch the necessary objects (updating the uncached list as it goes), or fail with a message such as "Cannot perform the requested operation - required objects are not available". (We may require another utility that would list the objects required for an operation, and compare it against the list of "uncached objects", printing out a list of which are not yet available locally. I realise that this may be expensive. Maybe a repo configuration option "cached" to enable or disable this.) As Jakub suggested, it would be necessary to configure the location of the source for any missing objects, but that is probably in the repo config anyway. > A shallow clone with explicit cauterization in grafts file at > least would not have that problem. Although the user will still > not see the exact same result as what would happen in a full > repository, at least we can say "your git log ends at that > commit because your copy of the history does not go back beyond > that" and the user would understand. Or, we could say, perform the operation while you are online, and can access the necessary objects. If the user has explicitly chosen to make a lazy clone, then they should expect that at some point, whatever they do may require them to be online to access items that they have not yet cloned. Rogan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 14:47 ` Rogan Dawes @ 2006-06-10 14:58 ` Jakub Narebski 2006-06-10 15:14 ` Nicolas Pitre 1 sibling, 0 replies; 67+ messages in thread From: Jakub Narebski @ 2006-06-10 14:58 UTC (permalink / raw) To: git Rogan Dawes wrote: > Junio C Hamano wrote: >> Rogan Dawes <lists@dawes.za.net> writes: >> >>> Here's an idea. How about separating trees and commits from the actual >>> blobs (e.g. in separate packs)? >> >> If I remember my numbers correctly, trees for any project with a >> size that matters contribute nonnegligible amount of the total >> pack weight. Perhaps 10-25%. > > Out of curiosity, do you think that it may be possible for tree objects > to compress more/better if they are packed together? Or does the > existing pack compression logic already do the diff against similar tree > objects? The problem with compressing and deltafying trees is with sha1 objects identifiers, I guess. -- Jakub Narebski Warsaw, Poland ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 14:47 ` Rogan Dawes 2006-06-10 14:58 ` Jakub Narebski @ 2006-06-10 15:14 ` Nicolas Pitre 1 sibling, 0 replies; 67+ messages in thread From: Nicolas Pitre @ 2006-06-10 15:14 UTC (permalink / raw) To: Rogan Dawes; +Cc: Junio C Hamano, git On Sat, 10 Jun 2006, Rogan Dawes wrote: > Out of curiosity, do you think that it may be possible for tree objects to > compress more/better if they are packed together? Or does the existing pack > compression logic already do the diff against similar tree objects? Tree objects for the same directories are already packed and deltified against each other in a pack. Nicolas ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 8:36 ` Rogan Dawes 2006-06-10 9:08 ` Junio C Hamano @ 2006-06-10 17:53 ` Linus Torvalds 2006-06-10 18:02 ` Jon Smirl 2006-06-10 18:36 ` Rogan Dawes 1 sibling, 2 replies; 67+ messages in thread From: Linus Torvalds @ 2006-06-10 17:53 UTC (permalink / raw) To: Rogan Dawes; +Cc: Jon Smirl, Martin Langhoff, git On Sat, 10 Jun 2006, Rogan Dawes wrote: > > Here's an idea. How about separating trees and commits from the actual blobs > (e.g. in separate packs)? My reasoning is that the commits and trees should > only be a small portion of the overall repository size, and should not be that > expensive to transfer. (Of course, this is only a guess, and needs some > numbers to back it up.) The trees in particular are actually a pretty big part of the history. More importantly, the blobs compress horribly badly in the absense of history - a _lot_ of the compression in git packing comes very much from the fact that we do a good job at delta-compression. So if you get all of the commit/tree history, but none of the blob history, you're actually not going to win that much space. As already discussed, the _whole_ history packed with git is usually not insanely bigger than just the whole unpacked tree (with no history at all). So you'd think that getting just the top version of the tree would be a much bigger space-saving that it actually is. If you _also_ get all the tree and commit objects, the space saving is even less. I actually suspect that the most realistic way to handle this is to use the "fetch.c" logic (ie the incremental fetcher used by http), and add some mode to the git daemon where you fetch literally one object at a time (ie this would be totally _separate_ from the pack-file thing: you'd not ask for "git-upload-pack", you'd ask for something like "git-serve-objects" instead). The fetch.c logic really does allow for on-demand object fetching, and is thus much more suitable for incomplete repositories. HOWEVER. The fetch.c logic - by necessity - works on a object-by-object level. That means that you'd get no delta compression AT ALL, and I suspect that the downside of that would be a factor of ten expansion or more, which means that it would really not work that well in practice. It might be worth testing, though. It would work fine for the "after I have the initial cauterized tree, fetch small incremental updates" case. The operative word here being "small" and "incremental", because I'm pretty sure it really would suck for the case of a big fetch. But it would be _simple_, which is why it's worth trying out. It also has the advantage that it would solve the "I had data corruption on my disk, and lost 100 objects, but all the the rest is fine" issue. Again, that's not something that the efficient packing protocol handles, exactly because it assumes full history, and uses that to do all its optimizations. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 17:53 ` Linus Torvalds @ 2006-06-10 18:02 ` Jon Smirl 2006-06-10 18:36 ` Rogan Dawes 1 sibling, 0 replies; 67+ messages in thread From: Jon Smirl @ 2006-06-10 18:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rogan Dawes, Martin Langhoff, git Here's a random idea, how about a tool that turns a real pack into one that is segmented and then faults in segments if you do an operation that needs the old segments? The full pack would always look like it is there even if it isn't. Something like gitk would be modified not to fault in the missing segments. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 17:53 ` Linus Torvalds 2006-06-10 18:02 ` Jon Smirl @ 2006-06-10 18:36 ` Rogan Dawes 1 sibling, 0 replies; 67+ messages in thread From: Rogan Dawes @ 2006-06-10 18:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jon Smirl, Martin Langhoff, git Linus Torvalds wrote: > > On Sat, 10 Jun 2006, Rogan Dawes wrote: >> Here's an idea. How about separating trees and commits from the actual blobs >> (e.g. in separate packs)? My reasoning is that the commits and trees should >> only be a small portion of the overall repository size, and should not be that >> expensive to transfer. (Of course, this is only a guess, and needs some >> numbers to back it up.) > > The trees in particular are actually a pretty big part of the history. > > More importantly, the blobs compress horribly badly in the absense of > history - a _lot_ of the compression in git packing comes very much from > the fact that we do a good job at delta-compression. > > So if you get all of the commit/tree history, but none of the blob > history, you're actually not going to win that much space. As already > discussed, the _whole_ history packed with git is usually not insanely > bigger than just the whole unpacked tree (with no history at all). > > So you'd think that getting just the top version of the tree would be a > much bigger space-saving that it actually is. If you _also_ get all the > tree and commit objects, the space saving is even less. > One possibility, given that the full commit and tree history is so large, is simply to get the HEAD commit and the trees that the commit depends directly on, rather than fetching them all up front. > I actually suspect that the most realistic way to handle this is to use > the "fetch.c" logic (ie the incremental fetcher used by http), and add > some mode to the git daemon where you fetch literally one object at a time > (ie this would be totally _separate_ from the pack-file thing: you'd not > ask for "git-upload-pack", you'd ask for something like > "git-serve-objects" instead). > > The fetch.c logic really does allow for on-demand object fetching, and is > thus much more suitable for incomplete repositories. > > HOWEVER. The fetch.c logic - by necessity - works on a object-by-object > level. That means that you'd get no delta compression AT ALL, and I > suspect that the downside of that would be a factor of ten expansion or > more, which means that it would really not work that well in practice. Would it be possible to add a mode where fetch.c is given a list of desired objects, and returns a list of pointers to those objects? Then callers that already have such a list could be modified to pass the whole list at once, allowing at least SOME compression, and optimisation of round trips, etc? There would be a tradeoff in memory use, though, I guess. Rogan ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 2:21 ` Jon Smirl 2006-06-10 2:34 ` Carl Worth @ 2006-06-10 3:01 ` Linus Torvalds 1 sibling, 0 replies; 67+ messages in thread From: Linus Torvalds @ 2006-06-10 3:01 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, git On Fri, 9 Jun 2006, Jon Smirl wrote: > > No more cvs diff taking four minutes to finish. I have to do that > every time I want to generate a 10 line patch. Diffs can run locally. > No more cvs update to replace files I deleted because I messed up > edits in them. And I can have local branches, yeah! More importantly, when the CVS server is down (can you say "sourceforge"?), who cares? > What are we going to do about the BEOS developers on Mozilla? There > are a couple more obscure OSes. Well, the git cvsserver exporter apparently works well enough... Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:59 ` Linus Torvalds 2006-06-10 2:21 ` Jon Smirl @ 2006-06-10 2:30 ` Jon Smirl 2006-06-10 3:41 ` Martin Langhoff 2 siblings, 0 replies; 67+ messages in thread From: Jon Smirl @ 2006-06-10 2:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, git On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Fri, 9 Jun 2006, Jon Smirl wrote: > > > > > > Btw, does anybody know roughly how much data a initial "cvs co" takes on > > > the mozilla repo? Git will obviously get the whole history, and that will > > > inevitably be bigger than getting a single check-out, but it's not > > > necessarily orders of magnitude bigger. > > > > 339MB for initial checkout I ran the checkout through bzip and it is 36.4MB, 46.4MB with zip. So the ratio may be 15 to 1 for the cvs co vs git > And I think people run :pserver: with compression by default, so we're > likely talking about half that in actual download overhead, no? > > So a git clone would be about (wild handwaving, don't look at all the > assumptions) four times as expensive - assuming we only look at a poor DSL > line as the expense - as an initial CVS co, but you'd get the _whole_ > history. Which may or may not make up for it. For some people it will, for > others it won't. > > Of course, to make up for some of the initial costs, I suspect that some > people who are used to "cvs update" taking 15 minutes to update two files, > it would be a serious relief to see the git kind of "300 objects in five > seconds" kinds of pulls. > > Although I guess that's one of the CVS things that SVN improved on. At > least I'd hope so ;/ > > Linus > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:59 ` Linus Torvalds 2006-06-10 2:21 ` Jon Smirl 2006-06-10 2:30 ` Jon Smirl @ 2006-06-10 3:41 ` Martin Langhoff 2006-06-10 3:55 ` Junio C Hamano 2006-06-10 4:02 ` Linus Torvalds 2 siblings, 2 replies; 67+ messages in thread From: Martin Langhoff @ 2006-06-10 3:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jon Smirl, git On 6/10/06, Linus Torvalds <torvalds@osdl.org> wrote: > On Fri, 9 Jun 2006, Jon Smirl wrote: > > > > > > Btw, does anybody know roughly how much data a initial "cvs co" takes on > > > the mozilla repo? Git will obviously get the whole history, and that will > > > inevitably be bigger than getting a single check-out, but it's not > > > necessarily orders of magnitude bigger. > > > > 339MB for initial checkout > > And I think people run :pserver: with compression by default, so we're > likely talking about half that in actual download overhead, no? Yes, most people have -z3, and I agree with you, on paper it sounds like the cost is 1/4 of a git clone. However. The CVS protocol is very chatty because the client _acts_ extremely stupid. It says, ok, I got here an empty directory, and the server walks the client through every little step. And all that chatter is uncompressed cleartext under pserver. So the per-file and per-directory overhead are significant. I can do a cvs checkout via pserver:localhost but I don't know off-the-cuff how to measure the traffic. Hints? cheers, martin ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 3:41 ` Martin Langhoff @ 2006-06-10 3:55 ` Junio C Hamano 2006-06-10 4:02 ` Linus Torvalds 1 sibling, 0 replies; 67+ messages in thread From: Junio C Hamano @ 2006-06-10 3:55 UTC (permalink / raw) To: Martin Langhoff; +Cc: git "Martin Langhoff" <martin.langhoff@gmail.com> writes: > Yes, most people have -z3, and I agree with you, on paper it sounds > like the cost is 1/4 of a git clone. > > However. > > The CVS protocol is very chatty because the client _acts_ extremely > stupid. It says, ok, I got here an empty directory, and the server > walks the client through every little step. And all that chatter is > uncompressed cleartext under pserver. > > So the per-file and per-directory overhead are significant. I can do a > cvs checkout via pserver:localhost but I don't know off-the-cuff how > to measure the traffic. Hints? If you have an otherwise unused interface, you can look at ifconfig output and see RX/TX bytes? But that sounds very crude. Running it through a proxy perhaps? ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 3:41 ` Martin Langhoff 2006-06-10 3:55 ` Junio C Hamano @ 2006-06-10 4:02 ` Linus Torvalds 2006-06-10 4:11 ` Linus Torvalds 1 sibling, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-10 4:02 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jon Smirl, git On Sat, 10 Jun 2006, Martin Langhoff wrote: > > So the per-file and per-directory overhead are significant. I can do a > cvs checkout via pserver:localhost but I don't know off-the-cuff how > to measure the traffic. Hints? Over localhost, you won't see the biggest issue, which is just latency. The git protocol should be absolutely <i>wonderful</i> with bad latency, because once the early bakc-and-forth on what each side has is done, there's no synchronization any more - it's all just streaming, with full-frame TCP. If :pserver: does per-file "hey, what are you up to" kind of syncronization, the big killer would be the latency from one end to the other, regardless of any throughput. You can try to approximate the latency by just looking at the number of packets, and using a large MTU (and on localhost, the MTU will be pretty large - roughly 16kB. Don't count packet size at all, just count how many packets each protocol sends (both ways), ignoring packets that are just empty ACK's. I don't know how to build a tcpdump expression for "TCP packet with an empty payload", but I bet it's possible. [ And I won't guarantee that it's a wonderful approximation for "network cost", but I think it's potentially a reasonably good one. It's totally realistic to equate 32kB of _streaming_ data (two packets flowing in one direction with no synchronization) with just a single byte of data going back-and-forth synchronously ] Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 4:02 ` Linus Torvalds @ 2006-06-10 4:11 ` Linus Torvalds 2006-06-10 6:02 ` Jon Smirl 0 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-10 4:11 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jon Smirl, git On Fri, 9 Jun 2006, Linus Torvalds wrote: > > You can try to approximate the latency by just looking at the number of > packets, and using a large MTU (and on localhost, the MTU will be pretty > large - roughly 16kB. Don't count packet size at all, just count how many > packets each protocol sends (both ways), ignoring packets that are just > empty ACK's. Btw, the reason you should ignore empty acks is that they happen when you have a nice streaming one-way thing, because the TCP rules say that you should send an ACK every two full packets minimum, even if you have nothing to say. So empty acks really approximate to "streaming data", while packets with payload _could_ obviously mean "nice streaming data going both ways", but almost always end up being synchronization discussion of some sort. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 4:11 ` Linus Torvalds @ 2006-06-10 6:02 ` Jon Smirl 2006-06-10 6:15 ` Junio C Hamano 0 siblings, 1 reply; 67+ messages in thread From: Jon Smirl @ 2006-06-10 6:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, git Here's a new transport problem. When using git-clone to fetch Martin's tree it kept failing for me at dreamhost. I had a parallel fetch running on my local machine which has a much slower net connection. It finally finished and I am watching the end phase where it prints all of the 'walk' messages. The git-http-fetch process has jumped up to 800MB in size after being 2MB during the download. dreamhost has a 500MB process size limit so that is why my fetches kept failing there. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 6:02 ` Jon Smirl @ 2006-06-10 6:15 ` Junio C Hamano 2006-06-10 15:44 ` Jon Smirl 0 siblings, 1 reply; 67+ messages in thread From: Junio C Hamano @ 2006-06-10 6:15 UTC (permalink / raw) To: Jon Smirl; +Cc: git "Jon Smirl" <jonsmirl@gmail.com> writes: > Here's a new transport problem. When using git-clone to fetch Martin's > tree it kept failing for me at dreamhost. I had a parallel fetch > running on my local machine which has a much slower net connection. It > finally finished and I am watching the end phase where it prints all > of the 'walk' messages. The git-http-fetch process has jumped up to > 800MB in size after being 2MB during the download. dreamhost has a > 500MB process size limit so that is why my fetches kept failing there. The http-fetch process uses by mmaping the downloaded pack, and if I recall correctly we are talking about 600MB pack, so 500MB limit sounds impossible, perhaps? ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 6:15 ` Junio C Hamano @ 2006-06-10 15:44 ` Jon Smirl 2006-06-10 16:15 ` Timo Hirvonen ` (2 more replies) 0 siblings, 3 replies; 67+ messages in thread From: Jon Smirl @ 2006-06-10 15:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On 6/10/06, Junio C Hamano <junkio@cox.net> wrote: > "Jon Smirl" <jonsmirl@gmail.com> writes: > > > Here's a new transport problem. When using git-clone to fetch Martin's > > tree it kept failing for me at dreamhost. I had a parallel fetch > > running on my local machine which has a much slower net connection. It > > finally finished and I am watching the end phase where it prints all > > of the 'walk' messages. The git-http-fetch process has jumped up to > > 800MB in size after being 2MB during the download. dreamhost has a > > 500MB process size limit so that is why my fetches kept failing there. > > The http-fetch process uses by mmaping the downloaded pack, and > if I recall correctly we are talking about 600MB pack, so 500MB > limit sounds impossible, perhaps? The fetch on my local machine failed too. It left nothing behind, now I have to download the 680MB again. walk 1f19465388a4ef7aff7527a13f16122a809487d4 walk c3ca840256e3767d08c649f8d2761a1a887351ab walk 7a74e42699320c02b814b88beadb1ae65009e745 error: Couldn't get http://mirrors.catalyst.net.nz/pub/mozilla.git//refs/tags/JS%5F1%5F7%5FALPHA%5FBASE for tags/JS_1_7_ALPHA_BASE Couldn't resolve host 'mirrors.catalyst.net.nz' error: Could not interpret tags/JS_1_7_ALPHA_BASE as something to pull [jonsmirl@jonsmirl mozgit]$ cg update There is no GIT repository here (.git not found) [jonsmirl@jonsmirl mozgit]$ ls -a . .. [jonsmirl@jonsmirl mozgit]$ -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 15:44 ` Jon Smirl @ 2006-06-10 16:15 ` Timo Hirvonen 2006-06-10 18:37 ` Petr Baudis 2006-06-10 18:55 ` Lars Johannsen 2 siblings, 0 replies; 67+ messages in thread From: Timo Hirvonen @ 2006-06-10 16:15 UTC (permalink / raw) To: Jon Smirl; +Cc: junkio, git "Jon Smirl" <jonsmirl@gmail.com> wrote: > On 6/10/06, Junio C Hamano <junkio@cox.net> wrote: > > "Jon Smirl" <jonsmirl@gmail.com> writes: > > > > > Here's a new transport problem. When using git-clone to fetch Martin's > > > tree it kept failing for me at dreamhost. I had a parallel fetch > > > running on my local machine which has a much slower net connection. It > > > finally finished and I am watching the end phase where it prints all > > > of the 'walk' messages. The git-http-fetch process has jumped up to > > > 800MB in size after being 2MB during the download. dreamhost has a > > > 500MB process size limit so that is why my fetches kept failing there. > > > > The http-fetch process uses by mmaping the downloaded pack, and > > if I recall correctly we are talking about 600MB pack, so 500MB > > limit sounds impossible, perhaps? > > The fetch on my local machine failed too. It left nothing behind, now > I have to download the 680MB again. That's sad. Could git-clone be changed to not remove .git directory if fetching objects fails (after other files in the .git directory have been fetched)? You could then hopefully continue with git-pull. -- http://onion.dynserv.net/~timo/ ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 15:44 ` Jon Smirl 2006-06-10 16:15 ` Timo Hirvonen @ 2006-06-10 18:37 ` Petr Baudis 2006-06-10 18:55 ` Lars Johannsen 2 siblings, 0 replies; 67+ messages in thread From: Petr Baudis @ 2006-06-10 18:37 UTC (permalink / raw) To: Jon Smirl; +Cc: Junio C Hamano, git Dear diary, on Sat, Jun 10, 2006 at 05:44:58PM CEST, I got a letter where Jon Smirl <jonsmirl@gmail.com> said that... > The fetch on my local machine failed too. It left nothing behind, now > I have to download the 680MB again. > > walk 1f19465388a4ef7aff7527a13f16122a809487d4 > walk c3ca840256e3767d08c649f8d2761a1a887351ab > walk 7a74e42699320c02b814b88beadb1ae65009e745 > error: Couldn't get > http://mirrors.catalyst.net.nz/pub/mozilla.git//refs/tags/JS%5F1%5F7%5FALPHA%5FBASE > for tags/JS_1_7_ALPHA_BASE > Couldn't resolve host 'mirrors.catalyst.net.nz' > error: Could not interpret tags/JS_1_7_ALPHA_BASE as something to pull > [jonsmirl@jonsmirl mozgit]$ cg update > There is no GIT repository here (.git not found) > [jonsmirl@jonsmirl mozgit]$ ls -a > . .. > [jonsmirl@jonsmirl mozgit]$ You could try with cg-clone, which won't delete the repository if things fail. It will clone only the master branch, though. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ A person is just about as big as the things that make them angry. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 15:44 ` Jon Smirl 2006-06-10 16:15 ` Timo Hirvonen 2006-06-10 18:37 ` Petr Baudis @ 2006-06-10 18:55 ` Lars Johannsen 2 siblings, 0 replies; 67+ messages in thread From: Lars Johannsen @ 2006-06-10 18:55 UTC (permalink / raw) To: Jon Smirl; +Cc: git On (10/06/06 11:44), Jon Smirl wrote: > Date: Sat, 10 Jun 2006 11:44:58 -0400 > From: "Jon Smirl" <jonsmirl@gmail.com> > To: "Junio C Hamano" <junkio@cox.net> > Subject: Re: Figured out how to get Mozilla into git > Cc: git@vger.kernel.org > > On 6/10/06, Junio C Hamano <junkio@cox.net> wrote: > >"Jon Smirl" <jonsmirl@gmail.com> writes: > > > >> Here's a new transport problem. When using git-clone to fetch Martin's > >> tree it kept failing for me at dreamhost. I had a parallel fetch > >> running on my local machine which has a much slower net connection. It > >> finally finished and I am watching the end phase where it prints all > >> of the 'walk' messages. The git-http-fetch process has jumped up to > >> 800MB in size after being 2MB during the download. dreamhost has a > >> 500MB process size limit so that is why my fetches kept failing there. > > > >The http-fetch process uses by mmaping the downloaded pack, and > >if I recall correctly we are talking about 600MB pack, so 500MB > >limit sounds impossible, perhaps? > > The fetch on my local machine failed too. It left nothing behind, now > I have to download the 680MB again. > > walk 1f19465388a4ef7aff7527a13f16122a809487d4 > walk c3ca840256e3767d08c649f8d2761a1a887351ab > walk 7a74e42699320c02b814b88beadb1ae65009e745 > error: Couldn't get > http://mirrors.catalyst.net.nz/pub/mozilla.git//refs/tags/JS%5F1%5F7%5FALPHA%5FBASE > for tags/JS_1_7_ALPHA_BASE > Couldn't resolve host 'mirrors.catalyst.net.nz' > error: Could not interpret tags/JS_1_7_ALPHA_BASE as something to pull > [jonsmirl@jonsmirl mozgit]$ cg update > There is no GIT repository here (.git not found) > [jonsmirl@jonsmirl mozgit]$ ls -a > . .. > [jonsmirl@jonsmirl mozgit]$ To prevent repeat (on this repo) your could grab it with a browser: -mkdir tmp; cd tmp; git init-db; -copy mirror../pu/mozilla.git/objects/* to .git/objects/ -copy --||---.git/info/refs to refsinfo in tmp-dir gawk '{if ($2 !~ /\^\{\}$/) print $1 > sprintf(".git/%s",$2);}' refsinfo to extract branches and tags into ./git/refs/{heads,tags} start playing (after a backup) with git-fsck-objects, git-checkout etc. -- Lars Johannsen mail@Lars-johannsen.dk ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:33 ` Linus Torvalds 2006-06-10 1:43 ` Linus Torvalds @ 2006-06-11 22:00 ` Nicolas Pitre 2006-06-18 19:26 ` Linus Torvalds 1 sibling, 1 reply; 67+ messages in thread From: Nicolas Pitre @ 2006-06-11 22:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, Jon Smirl, git On Fri, 9 Jun 2006, Linus Torvalds wrote: > > > On Sat, 10 Jun 2006, Martin Langhoff wrote: > > > > Now I don't know how much memory or time this took, but it clearly > > completed ok. And, it's now a single pack, weighting a grand total of > > 617MB > > Ok, that's more than reasonable. That should be fairly easily mapped on a > 32-bit architecture without any huge problems, even with some VM > fragmentation going on. It might be borderline (and you definitely want a > 3:1 VM user:kernel split), but considering that the original CVS archive > was apparently 3GB, having a single 617M pack-file is still pretty damn > good. That's like 20% of the original, with all the obvious distribution > advantages. I played a bit with git-repack on that repo. the git-pack-objects memory usage grew to around 760MB (git-rev-list was less than that). So LRU of partial pack mappings might bring that down significantly. Then I used git-repack -a -f --window=20 --depth=20 which produced a nice 468MB pack file along with the invariant 45MB index file for a grand total of 535MB for the whole repo (the .git/refs/ directory alone still occupies 17MB on disk). So it is probably worth having deeper delta chains for large historic repositories as the deep revisions are unlikely to be referenced that often while the saving is quite significant. Nicolas ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-11 22:00 ` Nicolas Pitre @ 2006-06-18 19:26 ` Linus Torvalds 2006-06-18 21:40 ` Martin Langhoff 0 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-18 19:26 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Martin Langhoff, Jon Smirl, git On Sun, 11 Jun 2006, Nicolas Pitre wrote: > > Then I used git-repack -a -f --window=20 --depth=20 which produced a > nice 468MB pack file along with the invariant 45MB index file for a > grand total of 535MB for the whole repo (the .git/refs/ directory alone > still occupies 17MB on disk). Btw, can others with that mozilla repo confirm that a mozilla repository that has been repacked seems to be entirely fine, but git-fsck-objects (with "--full", of course) will report error: Packfile .git/objects/pack/pack-06389c21fc3c4312cbc9a4ddde087c907c1a840b.pack SHA1 mismatch with itself for me (the fsck then completes with no other errors what-so-ever, so the contents are actually fine). Or is it just me? Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-18 19:26 ` Linus Torvalds @ 2006-06-18 21:40 ` Martin Langhoff 2006-06-18 22:36 ` Linus Torvalds 0 siblings, 1 reply; 67+ messages in thread From: Martin Langhoff @ 2006-06-18 21:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, Jon Smirl, git On 6/19/06, Linus Torvalds <torvalds@osdl.org> wrote: > Or is it just me? No problems here with my latest import run. fsck-objects --full comes clean, takes 14m: /usr/bin/time git-fsck-objects --full 737.22user 38.79system 14:09.40elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (20807major+19483471minor)pagefaults 0swaps BTW, that import (with the latest code Junio has) took 37hs even with the aggressive repack -a -d. I want to bench it dropping the -a from the recurrring repack, and doing a final repack -a -d. cheers, martin ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-18 21:40 ` Martin Langhoff @ 2006-06-18 22:36 ` Linus Torvalds 2006-06-18 22:51 ` Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) Linus Torvalds 0 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-18 22:36 UTC (permalink / raw) To: Martin Langhoff; +Cc: Nicolas Pitre, Jon Smirl, git On Mon, 19 Jun 2006, Martin Langhoff wrote: > > No problems here with my latest import run. fsck-objects --full comes > clean, takes 14m: > > /usr/bin/time git-fsck-objects --full > 737.22user 38.79system 14:09.40elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (20807major+19483471minor)pagefaults 0swaps It takes much less than that for me: 408.40user 32.56system 7:22.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (145major+13455672minor)pagefaults 0swaps and in particular note the much lower minor pagefaults number (which is a very good approximation of total RSS). Mine is with all the memory optimizations in place, but I didn't see _that_ big of a difference, so there's something else in addition. However, the fact that I get "SHA1 mismatch with itself" is strange. The re-pack will always re-generate the SHA1, so I worry that this is perhaps some PPC-specific bug in SHA1 handling (and it's entirely possible that it's triggered by doing a SHA1 over a 500+MB area). The fact that you don't see it is indicative that it's somehow specific to my setup. > BTW, that import (with the latest code Junio has) took 37hs even with > the aggressive repack -a -d. I want to bench it dropping the -a from > the recurrring repack, and doing a final repack -a -d. Yeah, that's probably the right thing to do. The "-a" is ok with tons of memory, and I'm trying to make it ok with _less_ memory, but it's probably just not worth it. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) 2006-06-18 22:36 ` Linus Torvalds @ 2006-06-18 22:51 ` Linus Torvalds 2006-06-18 23:25 ` [PATCH] Fix PPC SHA1 routine for large input buffers Paul Mackerras 0 siblings, 1 reply; 67+ messages in thread From: Linus Torvalds @ 2006-06-18 22:51 UTC (permalink / raw) To: Martin Langhoff, Paul Mackerras; +Cc: Nicolas Pitre, Jon Smirl, git On Sun, 18 Jun 2006, Linus Torvalds wrote: > > On Mon, 19 Jun 2006, Martin Langhoff wrote: > > > > No problems here with my latest import run. fsck-objects --full comes > > clean, takes 14m: > > > > /usr/bin/time git-fsck-objects --full > > 737.22user 38.79system 14:09.40elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k > > 0inputs+0outputs (20807major+19483471minor)pagefaults 0swaps > > It takes much less than that for me: > > 408.40user 32.56system 7:22.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (145major+13455672minor)pagefaults 0swaps Ok, re-building the thing with MOZILLA_SHA1=1 rather than my default PPC_SHA1=1 fixes the problem. I no longer get that "SHA1 mismatch with itself" on the pack-file. Sadly, it also takes a _lot_ longer to fsck. Paul - I think the ppc SHA1_Update() overflows in 32 bits, when the length of the memory area to be checksummed is huge. In particular, the pack-file is 535MB in size, and the way we check the SHA1 checksum is by just mapping it all, doing a single SHA1_Update() over the whole pack-file, and comparing the end result with the internal SHA1 at the end of the pack-file. The PPC SHA1_Update() function starts off with: int SHA1_Update(SHA_CTX *c, const void *ptr, unsigned long n) { ... c->len += n << 3; which will obviously overflow if "n" is bigger than 29 bits, ie 512MB. So doing the length in bits (or whatever that "<<3" is there for) doesn't seem to be such a great idea. I guess we could make the caller just always chunk it up, but wouldn't it be nice to fix the PPC SHA1 implementation instead? That said, the _only_ thing this will ever trigger on in practice is exactly this one case: a large packfile whose checksum was _correctly_ generated - because pack-file generation does it in IO chunks using the csum-file interfaces - but that will be incorrectly checked because we check it all at once. So as bugs go, it's a fairly benign one. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH] Fix PPC SHA1 routine for large input buffers 2006-06-18 22:51 ` Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) Linus Torvalds @ 2006-06-18 23:25 ` Paul Mackerras 2006-06-19 5:02 ` Linus Torvalds 0 siblings, 1 reply; 67+ messages in thread From: Paul Mackerras @ 2006-06-18 23:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, Nicolas Pitre, Jon Smirl, git The PPC SHA1 routine had an overflow which meant that it gave incorrect results for input buffers >= 512MB. This fixes it by ensuring that the update of the total length in bits is done using 64-bit arithmetic. Signed-off-by: Paul Mackerras <paulus@samba.org> --- Linus Torvalds writes: > Paul - I think the ppc SHA1_Update() overflows in 32 bits, when the length > of the memory area to be checksummed is huge. Yep. I checked the assembly output of this, and it looks right, but I haven't actually tested it by running it... Paul. diff --git a/ppc/sha1.c b/ppc/sha1.c index 5ba4fc5..0820398 100644 --- a/ppc/sha1.c +++ b/ppc/sha1.c @@ -30,7 +30,7 @@ int SHA1_Update(SHA_CTX *c, const void * unsigned long nb; const unsigned char *p = ptr; - c->len += n << 3; + c->len += (uint64_t) n << 3; while (n != 0) { if (c->cnt || n < 64) { nb = 64 - c->cnt; ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH] Fix PPC SHA1 routine for large input buffers 2006-06-18 23:25 ` [PATCH] Fix PPC SHA1 routine for large input buffers Paul Mackerras @ 2006-06-19 5:02 ` Linus Torvalds 0 siblings, 0 replies; 67+ messages in thread From: Linus Torvalds @ 2006-06-19 5:02 UTC (permalink / raw) To: Paul Mackerras; +Cc: Martin Langhoff, Nicolas Pitre, Jon Smirl, git On Mon, 19 Jun 2006, Paul Mackerras wrote: > > The PPC SHA1 routine had an overflow which meant that it gave > incorrect results for input buffers >= 512MB. This fixes it by > ensuring that the update of the total length in bits is done using > 64-bit arithmetic. > > Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Linus Torvalds <torvalds@osdl.org> This fixes git-fsck-objects for me on the mozilla archive, no more complaints about bad SHA1's. And yeah, now it's taking me 14 minutes too, so the 7-minute fsck was just because it didn't actually check the SHA1 of the large pack fully. (Which is actually good news - half of the time is literally checking the pack integrity. That implies that the individual object integrity isn't as dominating as I thought it would be, and that things like hw-accelerated SHA1 engines will help with fsck. I'd not be surprised to see things like that in a couple of years). Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 2:17 Figured out how to get Mozilla into git Jon Smirl 2006-06-09 2:56 ` Nicolas Pitre 2006-06-09 3:06 ` Martin Langhoff @ 2006-06-09 3:12 ` Pavel Roskin 2 siblings, 0 replies; 67+ messages in thread From: Pavel Roskin @ 2006-06-09 3:12 UTC (permalink / raw) To: Jon Smirl; +Cc: git Hi Jon, Quoting Jon Smirl <jonsmirl@gmail.com>: > I was able to import Mozilla into SVN without problem, it just occured > to me to then import the SVN repository in git. I feel bad that I didn't suggest it before. That's quite expected. Subversion was created by CVS developers with the intention of replacing CVS. cvs2svn was written by the same CVS developers, who paid attention to all CVS quirks. cvs2svn is quite mature and it has a testsuite, if I remember correctly. My concern is how well a Subversion repository can be mapped to git considering that Subversion is branch agnostic. But if it works for Mozilla, this approach could be recommended for anything big and serious. > The import has been > running a few hours now and it is up to the year 2000 (starts in > 1998). Since I haven't hit any errors yet it will probably finish ok. > I should have the results in the morning. I wonder how long it will > take to start gitk on a 10GB repository. That's the "raison d'etre" of qgit. I don't know if gitk has anything that qgit doesn't, except bisecting. > Once I get this monster into git, are there tools that will let me > keep it in sync with Mozilla CVS? Ideally, make Mozilla developers use git :-) > SVN renamed numeric branches to this form, unlabeled-3.7.24, so that > may be a problem. I think git-svn is supposed to do the svn->git part, but I'm afraid it will need some work to do it effectively. Google search for "cvs2svn incremental" brings some patches. cvsup can be used to synchronize the CVS repository. -- Regards, Pavel Roskin ^ permalink raw reply [flat|nested] 67+ messages in thread
end of thread, other threads:[~2006-06-19 5:03 UTC | newest] Thread overview: 67+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-09 2:17 Figured out how to get Mozilla into git Jon Smirl 2006-06-09 2:56 ` Nicolas Pitre 2006-06-09 3:06 ` Martin Langhoff 2006-06-09 3:28 ` Jon Smirl 2006-06-09 7:17 ` Jakub Narebski 2006-06-09 15:01 ` Linus Torvalds 2006-06-09 16:11 ` Nicolas Pitre 2006-06-09 16:30 ` Linus Torvalds 2006-06-09 17:38 ` Nicolas Pitre 2006-06-09 17:49 ` Linus Torvalds 2006-06-09 17:10 ` Jakub Narebski 2006-06-09 18:13 ` Jon Smirl 2006-06-09 19:00 ` Linus Torvalds 2006-06-09 20:17 ` Jon Smirl 2006-06-09 20:40 ` Linus Torvalds 2006-06-09 20:56 ` Jon Smirl 2006-06-09 21:57 ` Linus Torvalds 2006-06-09 22:17 ` Linus Torvalds 2006-06-09 23:16 ` Greg KH 2006-06-09 23:37 ` Martin Langhoff 2006-06-09 23:43 ` Linus Torvalds 2006-06-10 0:00 ` Jon Smirl 2006-06-10 0:11 ` Linus Torvalds 2006-06-10 0:16 ` Jon Smirl 2006-06-10 0:45 ` Jon Smirl 2006-06-09 20:44 ` Jakub Narebski 2006-06-09 21:05 ` Nicolas Pitre 2006-06-09 21:46 ` Jon Smirl 2006-06-10 1:23 ` Martin Langhoff 2006-06-10 1:14 ` Martin Langhoff 2006-06-10 1:33 ` Linus Torvalds 2006-06-10 1:43 ` Linus Torvalds 2006-06-10 1:48 ` Jon Smirl 2006-06-10 1:59 ` Linus Torvalds 2006-06-10 2:21 ` Jon Smirl 2006-06-10 2:34 ` Carl Worth 2006-06-10 3:08 ` Linus Torvalds 2006-06-10 8:21 ` Jakub Narebski 2006-06-10 9:00 ` Junio C Hamano 2006-06-10 8:36 ` Rogan Dawes 2006-06-10 9:08 ` Junio C Hamano 2006-06-10 14:47 ` Rogan Dawes 2006-06-10 14:58 ` Jakub Narebski 2006-06-10 15:14 ` Nicolas Pitre 2006-06-10 17:53 ` Linus Torvalds 2006-06-10 18:02 ` Jon Smirl 2006-06-10 18:36 ` Rogan Dawes 2006-06-10 3:01 ` Linus Torvalds 2006-06-10 2:30 ` Jon Smirl 2006-06-10 3:41 ` Martin Langhoff 2006-06-10 3:55 ` Junio C Hamano 2006-06-10 4:02 ` Linus Torvalds 2006-06-10 4:11 ` Linus Torvalds 2006-06-10 6:02 ` Jon Smirl 2006-06-10 6:15 ` Junio C Hamano 2006-06-10 15:44 ` Jon Smirl 2006-06-10 16:15 ` Timo Hirvonen 2006-06-10 18:37 ` Petr Baudis 2006-06-10 18:55 ` Lars Johannsen 2006-06-11 22:00 ` Nicolas Pitre 2006-06-18 19:26 ` Linus Torvalds 2006-06-18 21:40 ` Martin Langhoff 2006-06-18 22:36 ` Linus Torvalds 2006-06-18 22:51 ` Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) Linus Torvalds 2006-06-18 23:25 ` [PATCH] Fix PPC SHA1 routine for large input buffers Paul Mackerras 2006-06-19 5:02 ` Linus Torvalds 2006-06-09 3:12 ` Figured out how to get Mozilla into git Pavel Roskin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).