Figured out how to get Mozilla into git

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Figured out how to get Mozilla into git
@ 2006-06-09  2:17 Jon Smirl
  2006-06-09  2:56 ` Nicolas Pitre
                   ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Jon Smirl @ 2006-06-09  2:17 UTC (permalink / raw)
  To: git

I was able to import Mozilla into SVN without problem, it just occured
to me to then import the SVN repository in git. The import has been
running a few hours now and it is up to the year 2000 (starts in
1998). Since I haven't hit any errors yet it will probably finish ok.
I should have the results in the morning. I wonder how long it will
take to start gitk on a 10GB repository.

Once I get this monster into git, are there tools that will let me
keep it in sync with Mozilla CVS?
SVN renamed numeric branches to this form, unlabeled-3.7.24, so that
may be a problem.

Any advice on how to pack this to make it run faster?

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09  2:17 Figured out how to get Mozilla into git Jon Smirl
@ 2006-06-09  2:56 ` Nicolas Pitre
  2006-06-09  3:06 ` Martin Langhoff
  2006-06-09  3:12 ` Figured out how to get Mozilla into git Pavel Roskin
  2 siblings, 0 replies; 69+ messages in thread
From: Nicolas Pitre @ 2006-06-09  2:56 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On Thu, 8 Jun 2006, Jon Smirl wrote:

> I was able to import Mozilla into SVN without problem, it just occured
> to me to then import the SVN repository in git. The import has been
> running a few hours now and it is up to the year 2000 (starts in
> 1998). Since I haven't hit any errors yet it will probably finish ok.
> I should have the results in the morning. I wonder how long it will
> take to start gitk on a 10GB repository.

Before you do so consider repacking the repository with 

  git-repack -a -f -d && git-prune-packed


Nicolas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09  2:17 Figured out how to get Mozilla into git Jon Smirl
  2006-06-09  2:56 ` Nicolas Pitre
@ 2006-06-09  3:06 ` Martin Langhoff
  2006-06-09  3:28   ` Jon Smirl
                     ` (2 more replies)
  2006-06-09  3:12 ` Figured out how to get Mozilla into git Pavel Roskin
  2 siblings, 3 replies; 69+ messages in thread
From: Martin Langhoff @ 2006-06-09  3:06 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon,

oh, I went back to a cvsimport that I started a couple days ago.
Completed with no problems...

Last commit:
commit 5ecb56b9c4566618fad602a8da656477e4c6447a
Author: wtchang%redhat.com <wtchang%redhat.com>
Date:   Fri Jun 2 17:20:37 2006 +0000

    Import NSPR 4.6.2 and NSS 3.11.1

mozilla.git$ du -sh .git/
2.0G    .git/

It took
43492.19user 53504.77system 40:23:49elapsed 66%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (77334major+3122469478minor)pagefaults 0swaps

> I should have the results in the morning. I wonder how long it will
> take to start gitk on a 10GB repository.

Hopefully not that big :) -- anyway, just do gitk --max-count=1000

> Once I get this monster into git, are there tools that will let me
> keep it in sync with Mozilla CVS?

If you use git-cvsimport, you can safely re-run it on a cronjob to
keep it in sync. Not too sure about the cvs2svn => git-svnimport,
though git-svnimport does support incremental imports.

> SVN renamed numeric branches to this form, unlabeled-3.7.24, so that
> may be a problem.

Ouch,

> Any advice on how to pack this to make it run faster?

git-repack -a -d but it OOMs on my 2GB+2GBswap machine :(


martin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09  3:06 ` Martin Langhoff
@ 2006-06-09  3:28   ` Jon Smirl
  2006-06-09  7:17     ` Jakub Narebski
  2006-06-09 18:13   ` Jon Smirl
  2006-06-10  1:14   ` Martin Langhoff
  2 siblings, 1 reply; 69+ messages in thread
From: Jon Smirl @ 2006-06-09  3:28 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git

On 6/8/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> Jon,
>
> oh, I went back to a cvsimport that I started a couple days ago.
> Completed with no problems...

I am using cvsps-2.1-3.fc5, the last time I tried it died in the
middle of the import. I don't remember why it died. Which cvsps are
you using? You're saying that it can handle the whole Mozilla CVS now,
right? I will build a new cvsps from CVS and start it running tonight.

> If you use git-cvsimport, you can safely re-run it on a cronjob to
> keep it in sync. Not too sure about the cvs2svn => git-svnimport,
> though git-svnimport does support incremental imports.

I would much rather get a direct CVS import working so that I can do
incremental updates. I went the SVN route because it was the only
thing I could get working.

> > Any advice on how to pack this to make it run faster?
>
> git-repack -a -d but it OOMs on my 2GB+2GBswap machine :(

We are all having problems getting this to run on 32 bit machines with
the 3-4GB process size limitations.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09  3:28   ` Jon Smirl
@ 2006-06-09  7:17     ` Jakub Narebski
  2006-06-09 15:01       ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Jakub Narebski @ 2006-06-09  7:17 UTC (permalink / raw)
  To: git

Jon Smirl wrote:


>> git-repack -a -d but it OOMs on my 2GB+2GBswap machine :(
> 
> We are all having problems getting this to run on 32 bit machines with
> the 3-4GB process size limitations.

Is that expected (for 10GB repository if I remember correctly), or is there
some way to avoid this OOM?

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09  7:17     ` Jakub Narebski
@ 2006-06-09 15:01       ` Linus Torvalds
  2006-06-09 16:11         ` Nicolas Pitre
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-09 15:01 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

On Fri, 9 Jun 2006, Jakub Narebski wrote:
> Jon Smirl wrote:
> 
> >> git-repack -a -d but it OOMs on my 2GB+2GBswap machine :(
> > 
> > We are all having problems getting this to run on 32 bit machines with
> > the 3-4GB process size limitations.
> 
> Is that expected (for 10GB repository if I remember correctly), or is there
> some way to avoid this OOM?

Well, to some degree, the VM limitations are inevitable with huge packs.

The original idea for packs was to avoid making one huge pack, partly 
because it was expected to be really really slow to generate (so 
incremental repacking was a much better strategy), but partly simply 
because trying to map one huge pack is really hard to do.

For various reasons, we ended up mostly using a single pack most of the 
time: it's the most efficient model when the project is reasonably sized, 
and it turns out that with the delta re-use, repacking even moderately 
large projects like the kernel doesn't actually take all that long.

But the fact that we ended up mostly using a single pack for the kernel, 
for example, doesn't mean that the fundamental reasons that git supports 
multiple packs would somehow have gone away. At some point, the project 
gets large enough that one single pack simply isn't reasonable.

So a single 2GB pack is already very much pushing it. It's really really 
hard to map in a 2GB file on a 32-bit platform: your VM is usually 
fragmented enough that it simply isn't practical. In fact, I think the 
limit for _practical_ usage of single packs is probably somewhere in the 
half-gig region, unless you just have 64-bit machines.

And yes, I realize that the "single pack" thing actually ends up having 
become a fact for cloning, for example. Originally, cloning would unpack 
on the receiving end, and leave the repacking to happen there, but that 
obviously sucked. So now when we clone, we always get a single pack. That 
can absolutely be a problem.

I don't know what the right solution is. Single packs _are_ very useful, 
especially after a clone. So it's possible that we should just make the 
pack-reading code be able to map partial packs. But the point is that 
there are certainly ways we can fix this - it's not _really_ fundamental.

It's going to complicate it a bit (damn, how I hate 32-bit VM 
limitations), but the good news is that the whole git model of "everything 
is an individual object" means that it's a very _local_ decision: it will 
probably be painful to re-do some of the pack reading code and have a LRU 
of pack _fragments_ instead of a LRU of packs, but it's only going to 
affect a small part of git, and everything else will never even see it.

So large packs are not really a fundamental problem, but right now we have 
some practical issues with them.

(It's not _just_ packs: running out of memory is also because of 
git-rev-list --objects being pretty memory hungry. I've improved the 
memory usage several times by over 50%, but people keep trying larger 
projects. It used to be that I considered the kernel a large history, now 
we're talking about things that have ten times the number of objects).

Martin - do you have some place to make that big mozilla repo available? 
It would be a good test-case.. 

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 15:01       ` Linus Torvalds
@ 2006-06-09 16:11         ` Nicolas Pitre
  2006-06-09 16:30           ` Linus Torvalds
  2006-06-09 17:10           ` Jakub Narebski
  0 siblings, 2 replies; 69+ messages in thread
From: Nicolas Pitre @ 2006-06-09 16:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, git

On Fri, 9 Jun 2006, Linus Torvalds wrote:

> 
> 
> On Fri, 9 Jun 2006, Jakub Narebski wrote:
> > Jon Smirl wrote:
> > 
> > >> git-repack -a -d but it OOMs on my 2GB+2GBswap machine :(
> > > 
> > > We are all having problems getting this to run on 32 bit machines with
> > > the 3-4GB process size limitations.
> > 
> > Is that expected (for 10GB repository if I remember correctly), or is there
> > some way to avoid this OOM?

What was that 10GB related to, exactly?  The original CVS repo, or the 
unpacked GIT repo?

> So a single 2GB pack is already very much pushing it. It's really really 
> hard to map in a 2GB file on a 32-bit platform: your VM is usually 
> fragmented enough that it simply isn't practical. In fact, I think the 
> limit for _practical_ usage of single packs is probably somewhere in the 
> half-gig region, unless you just have 64-bit machines.

Sure, but have we already reached that size?

The historic Linux repo currently repacks itself into a ~175MB pack for 
63428 commits.

The current Linux repo is ~103MB with a much shorter history (27153 
commits).

Given the above we can estimate the size of the kernel repository after 
x commits as follows:

	slope = (175 - 103) / (63428 - 27153) = approx 2KB per commit

	initial size = 175 - .001985 * 63428 = 49MB

So the initial kernel commit is about 49MB in size which is coherent 
with the corresponding compressed tarball.  Subsequent commits are 2KB 
in size on average.  Given that it will take about 233250 commits before 
the kernel reaches the half gigabyte pack file, and given the current 
commit rate (approx 23700 commits per year), that means we still have 
nearly 9 years to go.  And at that point 64-bit machines are likely to 
be the norm.

So given those numbers I don't think this is really an issue.  The Linux 
kernel is a rather huge and pretty active project to base comparisons 
against.  The Mozilla repository might be difficult to import and 
repack, but once repacked it should still be pretty usable now even on a 
32-bit machine even with a single pack.

Otherwise that should be quite easy to add a batch size argument to 
git-repack so git-rev-list and git-pack-objects are called multiple 
times with sequential commit 
ranges to create a repo with multiple packs.

Nicolas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 16:11         ` Nicolas Pitre
@ 2006-06-09 16:30           ` Linus Torvalds
  2006-06-09 17:38             ` Nicolas Pitre
  2006-06-09 17:10           ` Jakub Narebski
  1 sibling, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-09 16:30 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, git



On Fri, 9 Jun 2006, Nicolas Pitre wrote:
> 
> > So a single 2GB pack is already very much pushing it. It's really really 
> > hard to map in a 2GB file on a 32-bit platform: your VM is usually 
> > fragmented enough that it simply isn't practical. In fact, I think the 
> > limit for _practical_ usage of single packs is probably somewhere in the 
> > half-gig region, unless you just have 64-bit machines.
> 
> Sure, but have we already reached that size?

Not for the Linux repos.

But apparently the mozilla repo ends up being 2GB in git. From Martin:

  >> oh, I went back to a cvsimport that I started a couple days ago.
  >> Completed with no problems...
  >> 
  >> Last commit:
  >> commit 5ecb56b9c4566618fad602a8da656477e4c6447a
  >> Author: wtchang%redhat.com <wtchang%redhat.com>
  >> Date:   Fri Jun 2 17:20:37 2006 +0000
  >> 
  >>    Import NSPR 4.6.2 and NSS 3.11.1
  >> 
  >> mozilla.git$ du -sh .git/
  >> 2.0G    .git/

now that was done with _incremental_ repacking (ie his .git directory
won't be just one large pack), but I bet that if you were to clone it
(without using the "-l" flag or rsync/http), you'd end up with serious
trouble because of the single-pack limit.

So we're starting to see archives where single packs are problematic for
a 32-bit architecture. 

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 16:30           ` Linus Torvalds
@ 2006-06-09 17:38             ` Nicolas Pitre
  2006-06-09 17:49               ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Nicolas Pitre @ 2006-06-09 17:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, git

On Fri, 9 Jun 2006, Linus Torvalds wrote:

> 
> 
> On Fri, 9 Jun 2006, Nicolas Pitre wrote:
> > 
> > > So a single 2GB pack is already very much pushing it. It's really really 
> > > hard to map in a 2GB file on a 32-bit platform: your VM is usually 
> > > fragmented enough that it simply isn't practical. In fact, I think the 
> > > limit for _practical_ usage of single packs is probably somewhere in the 
> > > half-gig region, unless you just have 64-bit machines.
> > 
> > Sure, but have we already reached that size?
> 
> Not for the Linux repos.
> 
> But apparently the mozilla repo ends up being 2GB in git. From Martin:
> 
>   >> oh, I went back to a cvsimport that I started a couple days ago.
>   >> Completed with no problems...
>   >> 
>   >> Last commit:
>   >> commit 5ecb56b9c4566618fad602a8da656477e4c6447a
>   >> Author: wtchang%redhat.com <wtchang%redhat.com>
>   >> Date:   Fri Jun 2 17:20:37 2006 +0000
>   >> 
>   >>    Import NSPR 4.6.2 and NSS 3.11.1
>   >> 
>   >> mozilla.git$ du -sh .git/
>   >> 2.0G    .git/

He also sais:

| git-repack -a -d but it OOMs on my 2GB+2GBswap machine :(

> now that was done with _incremental_ repacking (ie his .git directory
> won't be just one large pack),

So given the nature of packs, incrementally packing an imported 
repository _might_ cause worse problems since each pack must be self 
referenced by definition.  That means you may end up with multiple 
revisions of the same file distributed amongst as many packs hence none 
of those revisions are ever deltified, and to repack that you currently 
have to mmap all those packs at once.

> but I bet that if you were to clone it
> (without using the "-l" flag or rsync/http), you'd end up with serious
> trouble because of the single-pack limit.

Maybe that single pack would instead be under the 512MB limit?  I'd be 
curious to know.

> So we're starting to see archives where single packs are problematic for
> a 32-bit architecture. 

Depending on the operation, the single pack might actually be better, 
especially for a full clone where everything gets mapped.  Multiple 
packs will always take more space, which is fine if you don't need 
access to all objects at once since individual packs are small, but the 
whole of them (when repacking or cloning) isn't.

Nicolas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 17:38             ` Nicolas Pitre
@ 2006-06-09 17:49               ` Linus Torvalds
  0 siblings, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2006-06-09 17:49 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, git

On Fri, 9 Jun 2006, Nicolas Pitre wrote:
>
> Maybe that single pack would instead be under the 512MB limit?  I'd be 
> curious to know.

Possible, but not likely, and with "git repack -a -d" running out of 
memory, we clearly already have a problem in checking that.

That is most likely git-rev-list, though. Which is why I'd like to just 
rsync the repo, and run git-rev-list on it, and see what else I can shave 
off ;)

> > So we're starting to see archives where single packs are problematic for
> > a 32-bit architecture. 
> 
> Depending on the operation, the single pack might actually be better, 

Absolutely. Which is why I said we probably need to do a LRU on pack 
fragments rather than full packs when we do the pack memory mapping.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 16:11         ` Nicolas Pitre
  2006-06-09 16:30           ` Linus Torvalds
@ 2006-06-09 17:10           ` Jakub Narebski
  1 sibling, 0 replies; 69+ messages in thread
From: Jakub Narebski @ 2006-06-09 17:10 UTC (permalink / raw)
  To: git

Nicolas Pitre wrote:

> What was that 10GB related to, exactly?  The original CVS repo, or the 
> unpacked GIT repo?

Erm, Subversion repository, result of cvs2svn conversion:

Jon Smirl> I wonder how long it will take to start gitk on a 10GB 
Jon Smirl> repository.

(in first post in this thread).

> Otherwise that should be quite easy to add a batch size argument to 
> git-repack so git-rev-list and git-pack-objects are called multiple 
> times with sequential commit ranges to create a repo with multiple
> packs. 

Good idea. In addition to best size pack limted by 32bit and/or RAM size +
swap size limit, there are (rare) limits of maximum filesize on filesystem,
e.g. FAT28^W FAT32.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09  3:06 ` Martin Langhoff
  2006-06-09  3:28   ` Jon Smirl
@ 2006-06-09 18:13   ` Jon Smirl
  2006-06-09 19:00     ` Linus Torvalds
  2006-06-10  1:14   ` Martin Langhoff
  2 siblings, 1 reply; 69+ messages in thread
From: Jon Smirl @ 2006-06-09 18:13 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git

On 6/8/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> mozilla.git$ du -sh .git/
> 2.0G    .git/

That looks too small. My svn git import is 2.7GB and the source CVS is
3.0GB. The svn import wasn't finished when I stopped it.

My cvsps process is still running from last night. The error file is
341MB. How big is it when the conversion is finished? My machine is
swapping to death.

I'm still attracted to the cvs2svn tool. It handled everything right
the first time and it only needs 100MB to run. It is also a lot
faster. cvsps and parsecvs both need gigabytes of RAM to run. I'll
look at cvs2svn some more but I still need to figure out more about
low level git and learn Python.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 18:13   ` Jon Smirl
@ 2006-06-09 19:00     ` Linus Torvalds
  2006-06-09 20:17       ` Jon Smirl
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-09 19:00 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, git



On Fri, 9 Jun 2006, Jon Smirl wrote:
> 
> That looks too small. My svn git import is 2.7GB and the source CVS is
> 3.0GB. The svn import wasn't finished when I stopped it.

Git is much better at packing than either CVS or SVN. Get used to it ;)

> My cvsps process is still running from last night. The error file is
> 341MB. How big is it when the conversion is finished? My machine is
> swapping to death.

Do you have all the cvsps patches? There's a few important ones floating 
around, and David Mansfield never did a 2.2 release..

I'm pretty sure Martin doesn't run plain 2.1.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 19:00     ` Linus Torvalds
@ 2006-06-09 20:17       ` Jon Smirl
  2006-06-09 20:40         ` Linus Torvalds
                           ` (3 more replies)
  0 siblings, 4 replies; 69+ messages in thread
From: Jon Smirl @ 2006-06-09 20:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git

On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Fri, 9 Jun 2006, Jon Smirl wrote:
> >
> > That looks too small. My svn git import is 2.7GB and the source CVS is
> > 3.0GB. The svn import wasn't finished when I stopped it.
>
> Git is much better at packing than either CVS or SVN. Get used to it ;)

The git tree that Martin got from cvsps is much smaller that the git
tree I got from going to svn then to git.  I don't why the trees are
700KB different, it may be different amounts of packing, or one of the
conversion tools is losing something.

Earlier he said:
>git-repack -a -d but it OOMs on my 2GB+2GBswap machine :(

> > My cvsps process is still running from last night. The error file is
> > 341MB. How big is it when the conversion is finished? My machine is
> > swapping to death.
>
> Do you have all the cvsps patches? There's a few important ones floating
> around, and David Mansfield never did a 2.2 release..

I am running cvsps-2.1-3.fc5 so I may be wasting my time. Error out is
535MB now.
He sent me some git patches, but none for cvsps.

> I'm pretty sure Martin doesn't run plain 2.1.

I haven't come up with anything that is likely to result in Mozilla
switching over to git. Right now it takes three days to convert the
tree. The tree will have to be run in parallel for a while to convince
everyone to switch. I don't have a solution to keeping it in sync in
near real time (commits would still go to CVS). Most Mozilla
developers are interested but the infrastructure needs some help.

Martin has also brought up the problem with needing a partial clone so
that everyone doesn't have to bring down the entire repository. A
trunk checkout is 340MB and Martin's git tree is 2GB (mine 2.7GB).  A
kernel tree is only 680M.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 20:17       ` Jon Smirl
@ 2006-06-09 20:40         ` Linus Torvalds
  2006-06-09 20:56           ` Jon Smirl
  2006-06-09 20:44         ` Jakub Narebski
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-09 20:40 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, git

On Fri, 9 Jun 2006, Jon Smirl wrote:
>
> > Git is much better at packing than either CVS or SVN. Get used to it ;)
> 
> The git tree that Martin got from cvsps is much smaller that the git
> tree I got from going to svn then to git.  I don't why the trees are
> 700KB different, it may be different amounts of packing, or one of the
> conversion tools is losing something.

.. or one of them is adding something.

For example, it may well be that cvs2svn does a lot more commits or 
something like that.

That said, I don't even see where git-svn packs anythign at all, and 
you're absolutely right that when/how you repack can make a huge 
difference to disk usage, much more so than any importer details.

> > Do you have all the cvsps patches? There's a few important ones floating
> > around, and David Mansfield never did a 2.2 release..
> 
> I am running cvsps-2.1-3.fc5 so I may be wasting my time. Error out is
> 535MB now.
> He sent me some git patches, but none for cvsps.

I've got a couple, but I was hoping David would do a cvsps-2.2. I have 
this dim memory of him saying he had done some other improvements too.

> I haven't come up with anything that is likely to result in Mozilla
> switching over to git. Right now it takes three days to convert the
> tree. The tree will have to be run in parallel for a while to convince
> everyone to switch. I don't have a solution to keeping it in sync in
> near real time (commits would still go to CVS). Most Mozilla
> developers are interested but the infrastructure needs some help.

Sure. That said, I pretty much guarantee that the size issues will be much 
much worse for any other distributed SCM. 

If Mozilla doesn't need the distributed thing, then SVN is probably the 
best choice. It's still a total piece of crap, but hey, if crap (== 
centralized) is what people are used to, a few billion flies can't be 
wrong ;)

If you got your import done, is there some place I can rsync it from, and 
at least I can make sure that everything works fine for a repo that size.. 
One day the Mozilla people will notice that they really _really_ want the 
distribution, and they'll figure out quickly enough that SVK doesn't cut 
it, I suspect.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 20:40         ` Linus Torvalds
@ 2006-06-09 20:56           ` Jon Smirl
  2006-06-09 21:57             ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Jon Smirl @ 2006-06-09 20:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git

On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > I haven't come up with anything that is likely to result in Mozilla
> > switching over to git. Right now it takes three days to convert the
> > tree. The tree will have to be run in parallel for a while to convince
> > everyone to switch. I don't have a solution to keeping it in sync in
> > near real time (commits would still go to CVS). Most Mozilla
> > developers are interested but the infrastructure needs some help.
>
> Sure. That said, I pretty much guarantee that the size issues will be much
> much worse for any other distributed SCM.
>
> If Mozilla doesn't need the distributed thing, then SVN is probably the
> best choice. It's still a total piece of crap, but hey, if crap (==
> centralized) is what people are used to, a few billion flies can't be
> wrong ;)

They need the distributed thing whether they realize it or not. Some
of the external projects like songbird and nvu are vulnerable to drift
since they are running their own repositories.  Once a  few
move/renames happen they can't easily stay in sync anymore. It has
been over a year since NVU was merged back into the trunk.

That is the same reason I want it, so that I can work on stuff locally
and have a repository. The core staff doesn't have this problem
because they can make all the branches they want in the main
repository.

> If you got your import done, is there some place I can rsync it from, and
> at least I can make sure that everything works fine for a repo that size..
> One day the Mozilla people will notice that they really _really_ want the
> distribution, and they'll figure out quickly enough that SVK doesn't cut
> it, I suspect.

It would be better to rsync Martins copy, he has a lot more bandwidth.
It will take over a day to copy it off my cable modem. I'm signed up
to get FIOS as soon as they turn it on in my neighborhood, it's
already wired on the poles.

>
>                 Linus
>

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 20:56           ` Jon Smirl
@ 2006-06-09 21:57             ` Linus Torvalds
  2006-06-09 22:17               ` Linus Torvalds
                                 ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Linus Torvalds @ 2006-06-09 21:57 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, git

On Fri, 9 Jun 2006, Jon Smirl wrote:
> 
> They need the distributed thing whether they realize it or not. Some
> of the external projects like songbird and nvu are vulnerable to drift
> since they are running their own repositories.  Once a  few
> move/renames happen they can't easily stay in sync anymore. It has
> been over a year since NVU was merged back into the trunk.
> 
> That is the same reason I want it, so that I can work on stuff locally
> and have a repository. The core staff doesn't have this problem
> because they can make all the branches they want in the main
> repository.

Yes. Anyway, I think we'll get git working well for repositories that 
size, and eventually the core developers will notice how much better it 
is.

In the meantime, the fact that git-cvsimport can be done incrementally 
means that once we have the silly pack-file-mapping details worked out, it 
should be perfectly fine to run the 3-day import just once, and then work 
on it incrementally afterwards without any real problems.

So people like you who want to work on it off-line using a distributed 
system _can_ do so, realistically. Maybe not practically _today_, but I 
don't think the git issues are serious enough that we'd be talking about 
"months from now", but more of a "in a week or so we migh have something 
that works fine for your case".

[ They had this long discussion about languages on #monotone the other 
  day, and the reason I'll take C over anything else any day is the fact 
  that a well-written C program is literally only limited by hardware, 
  never by the language. The poor python/perl guys may write things more 
  quickly, but when they hit a language wall, they hit it. 

  I think we've got an excellent data model, and handling even something 
  huge like the _whole_ history of mozilla doesn't look very daunting at 
  all. I just want to have a real test-case to motivate me to look at the 
  problems. ]

> It would be better to rsync Martins copy, he has a lot more bandwidth.
> It will take over a day to copy it off my cable modem. I'm signed up
> to get FIOS as soon as they turn it on in my neighborhood, it's
> already wired on the poles.

Sure. I actually just have regular 128kbps DSL myself. I guess I should 
upgrade to 256 (the downside of having deer munching on the roses in our 
back yard is that I don't think I even have the option for anything 
faster), but I'm so damn well distributed that the slow 128kbps is 
actually more than enough - everything serious I do is local anyway.

So it will take me quite some time to download 2GB+, regardless of how fat 
a pipe the other end has ;)

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 21:57             ` Linus Torvalds
@ 2006-06-09 22:17               ` Linus Torvalds
  2006-06-09 23:16               ` Greg KH
  2006-06-09 23:37               ` Martin Langhoff
  2 siblings, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2006-06-09 22:17 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, git



On Fri, 9 Jun 2006, Linus Torvalds wrote:
> 
> Sure. I actually just have regular 128kbps DSL myself.

Not bits, bytes. 128_KB_/s, of course. Actually, it's slightly more. 
Something like 146KB/s, I guess that comes to 1.5Mbps.

Just in case somebody thought I was living in a cave in the middle ages.

Anyway, no nice 5Mbps cable for me.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 21:57             ` Linus Torvalds
  2006-06-09 22:17               ` Linus Torvalds
@ 2006-06-09 23:16               ` Greg KH
  2006-06-09 23:37               ` Martin Langhoff
  2 siblings, 0 replies; 69+ messages in thread
From: Greg KH @ 2006-06-09 23:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, Martin Langhoff, git

On Fri, Jun 09, 2006 at 02:57:58PM -0700, Linus Torvalds wrote:
> > It would be better to rsync Martins copy, he has a lot more bandwidth.
> > It will take over a day to copy it off my cable modem. I'm signed up
> > to get FIOS as soon as they turn it on in my neighborhood, it's
> > already wired on the poles.
> 
> So it will take me quite some time to download 2GB+, regardless of how fat 
> a pipe the other end has ;)

Fed-Ex a DVD or two would probably be fastest :)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 21:57             ` Linus Torvalds
  2006-06-09 22:17               ` Linus Torvalds
  2006-06-09 23:16               ` Greg KH
@ 2006-06-09 23:37               ` Martin Langhoff
  2006-06-09 23:43                 ` Linus Torvalds
  2 siblings, 1 reply; 69+ messages in thread
From: Martin Langhoff @ 2006-06-09 23:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, git

Apologies, I dropped out of the conversation -- Friday night drinks
(NZ timezone) took over ;-)

Now, back on track...

On 6/10/06, Linus Torvalds <torvalds@osdl.org> wrote:
> In the meantime, the fact that git-cvsimport can be done incrementally
> means that once we have the silly pack-file-mapping details worked out, it
> should be perfectly fine to run the 3-day import just once, and then work
> on it incrementally afterwards without any real problems.

Exactly. The dog at this time is cvsps -- I also remember vague
promises from a list regular of publishing a git repo with cvsps2.1 +
some patches from the list.

In any case, and for the record, my cvsps is 2.1 pristine. It handles
the mozilla repo alright, as long as I give it a lot of RAM. I _think_
it slurped 3GB with the mozilla cvs.

I want to review that cvs2svn importer, probably to steal the test
cases and perhaps some logic to revamp/replace cvsps. The thing is --
we can't just drop/replace cvsimport because it does incrementals, so
continuity and consistency are key. All the CVS imports have to take
some hard decisions when the data is bad -- however it is we fudge it,
we kind of want to fudge it consistently ;-)

> So people like you who want to work on it off-line using a distributed
> system _can_ do so, realistically. Maybe not practically _today_

Other than "don't run repack -a", it's feasible. In fact, that's how I
use git 99% of the time -- to do DSCM stuff on projects that are using
CVS, like Moodle.

>   The poor python/perl guys may write things more
>   quickly, but when they hit a language wall, they hit it.

Flamebait anyone? ;-) It is a different kind of fun -- let's say that
on top of knowing the performance tricks (or, to be more hip: "design
patterns") for the hardware and OS, you also end up learning the
performance tricks of the interpreter/vm/whatever.

> > It would be better to rsync Martins copy, he has a lot more bandwidth.

I'm coming down to the office now to pick up my laptop, and I'll rsync
it out to our git machine (also NZ kernel mirror, bandwidth should be
good). That's one of the things I've discovered with these large
trees: for the initial publish action, I just use rsync or scp.
Perhaps I'm doing it wrong, but git-push doesn't optimise the
'initialise repo', and it take ages (and it this case, it'd probably
OOM).

> So it will take me quite some time to download 2GB+, regardless of how fat
> a pipe the other end has ;)

Right-o. Linus, Jon, can you guys then ping me when you have cloned it
safely so I can take it down again?

cheers,

martin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 23:37               ` Martin Langhoff
@ 2006-06-09 23:43                 ` Linus Torvalds
  2006-06-10  0:00                   ` Jon Smirl
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-09 23:43 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git



On Sat, 10 Jun 2006, Martin Langhoff wrote:
> 
> Exactly. The dog at this time is cvsps -- I also remember vague
> promises from a list regular of publishing a git repo with cvsps2.1 +
> some patches from the list.

Ahh. cvsps doesn't do anything incrementally, does it?

Although it _does_ build up a cache of sorts, I think. That's not the 
parts I actually ever ended up looking at.

But yeah, a cvsps that blows up to a gig of VM and takes half an hour to 
parse things just for an incremental update would be a problem.

> In any case, and for the record, my cvsps is 2.1 pristine. It handles
> the mozilla repo alright, as long as I give it a lot of RAM. I _think_
> it slurped 3GB with the mozilla cvs.

Oh, wow. Every single repo I've seen ends up having tons of complaints 
from pristine cvsps, but maybe that's because I only end up looking at the 
ones with problems ;)

> I'm coming down to the office now to pick up my laptop, and I'll rsync
> it out to our git machine (also NZ kernel mirror, bandwidth should be
> good). That's one of the things I've discovered with these large
> trees: for the initial publish action, I just use rsync or scp.
> Perhaps I'm doing it wrong, but git-push doesn't optimise the
> 'initialise repo', and it take ages (and it this case, it'd probably
> OOM).
> 
> > So it will take me quite some time to download 2GB+, regardless of how fat
> > a pipe the other end has ;)
> 
> Right-o. Linus, Jon, can you guys then ping me when you have cloned it
> safely so I can take it down again?

Tell me where/when it is, and I'll start slurping. Will let you know when 
I'm done.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 23:43                 ` Linus Torvalds
@ 2006-06-10  0:00                   ` Jon Smirl
  2006-06-10  0:11                     ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Jon Smirl @ 2006-06-10  0:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git

On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
> On Sat, 10 Jun 2006, Martin Langhoff wrote:
> > In any case, and for the record, my cvsps is 2.1 pristine. It handles
> > the mozilla repo alright, as long as I give it a lot of RAM. I _think_
> > it slurped 3GB with the mozilla cvs.
>
> Oh, wow. Every single repo I've seen ends up having tons of complaints
> from pristine cvsps, but maybe that's because I only end up looking at the
> ones with problems ;)

Are we sure cvsps is ok? It is generating 500MB of warnings when I run it.

I have cvsps running at dreamhost currently. I had to modify cvs,
cvps, git, etc to not repsond to signals to keep them from killing
everything.

I can clone 2GB git tree there. Let me know when it is up.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  0:00                   ` Jon Smirl
@ 2006-06-10  0:11                     ` Linus Torvalds
  2006-06-10  0:16                       ` Jon Smirl
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-10  0:11 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, git



On Fri, 9 Jun 2006, Jon Smirl wrote:
> 
> Are we sure cvsps is ok? It is generating 500MB of warnings when I run it.

Do they go away with these patches?

		Linus
---
commit 3d1ebcef6b4f9f6c9064efd64da4dd30d93c3c96
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date:   Wed Mar 22 17:20:20 2006 -0800

    Fix branch ancestor calculation
    
    Not having any ancestor at all means that any valid ancestor (even of
    "depth 0") is fine.
    
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/cvsps.c b/cvsps.c
index c22147e..2695a0f 100644
--- a/cvsps.c
+++ b/cvsps.c
@@ -2599,7 +2599,7 @@ static void determine_branch_ancestor(Pa
 	 * note: rev is the pre-commit revision, not the post-commit
 	 */
 	if (!head_ps->ancestor_branch)
-	    d1 = 0;
+	    d1 = -1;
 	else if (strcmp(ps->branch, rev->branch) == 0)
 	    continue;
 	else if (strcmp(head_ps->ancestor_branch, "HEAD") == 0)

commit 82fcf7e31bbeae3b01a8656549e9b8fd89d598eb
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date:   Wed Mar 22 11:23:37 2006 -0800

    Improve handling of file collisions in the same patchset
    
    Take the file revision into account.

diff --git a/cvsps.c b/cvsps.c
index 1e64e3c..c22147e 100644
--- a/cvsps.c
+++ b/cvsps.c
@@ -2384,8 +2384,31 @@ void patch_set_add_member(PatchSet * ps,
     for (next = ps->members.next; next != &ps->members; next = next->next) 
     {
 	PatchSetMember * m = list_entry(next, PatchSetMember, link);
-	if (m->file == psm->file && ps->collision_link.next == NULL) 
-		list_add(&ps->collision_link, &collisions);
+	if (m->file == psm->file) {
+		int order = compare_rev_strings(psm->post_rev->rev, m->post_rev->rev);
+
+		/*
+		 * Same revision too? Add it to the collision list
+		 * if it isn't already.
+		 */
+		if (!order) {
+			if (ps->collision_link.next == NULL)
+				list_add(&ps->collision_link, &collisions);
+			return;
+		}
+
+		/*
+		 * If this is an older revision than the one we already have
+		 * in this patchset, just ignore it
+		 */
+		if (order < 0)
+			return;
+
+		/*
+		 * This is a newer one, remove the old one
+		 */
+		list_del(&m->link);
+	}
     }
 
     psm->ps = ps;

commit 534120d9a47062eecd7b53fd7ac0b70d97feb4fd
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date:   Wed Mar 22 11:20:59 2006 -0800

    Increase log-length limit to 64kB
    
    Yeah, it should be dynamic. I'm lazy.

diff --git a/cvsps_types.h b/cvsps_types.h
index b41e2a9..dba145d 100644
--- a/cvsps_types.h
+++ b/cvsps_types.h
@@ -8,7 +8,7 @@ #define CVSPS_TYPES_H
 
 #include <time.h>
 
-#define LOG_STR_MAX 32768
+#define LOG_STR_MAX 65536
 #define AUTH_STR_MAX 64
 #define REV_STR_MAX 64
 #define MIN(a, b) ((a) < (b) ? (a) : (b))

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  0:11                     ` Linus Torvalds
@ 2006-06-10  0:16                       ` Jon Smirl
  2006-06-10  0:45                         ` Jon Smirl
  0 siblings, 1 reply; 69+ messages in thread
From: Jon Smirl @ 2006-06-10  0:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git

I'll apply and give it a test.

They look like this for most of them.

WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM:
    security/coreconf/HP-UX.mk:1.8=after,
security/jss/org/mozilla/jss/crypto/KeyPairAlgorithm.java:1.5=before.
Treated as 'before'
WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM:
    security/coreconf/HP-UX.mk:1.8=after,
security/jss/org/mozilla/jss/crypto/KeyPairGenerator.java:1.5=before.
Treated as 'before'
WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM:
    security/coreconf/HP-UX.mk:1.8=after,
security/jss/org/mozilla/jss/crypto/KeyPairGeneratorSpi.java:1.3=before.
Treated as 'before'
WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM:
    security/coreconf/HP-UX.mk:1.8=after,
security/jss/org/mozilla/jss/crypto/KeyWrapAlgorithm.java:1.8=before.
Treated as 'before'
WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM:
    security/coreconf/HP-UX.mk:1.8=after,
security/jss/org/mozilla/jss/crypto/KeyWrapper.java:1.8=before.
Treated as 'before'
WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM:
    security/coreconf/HP-UX.mk:1.8=after,
security/jss/org/mozilla/jss/crypto/Makefile:1.2=before. Treated as
'before'
WARNING: Invalid PatchSet 151492, Tag JSS_4_0_RTM:
    security/coreconf/HP-UX.mk:1.8=after,
security/jss/org/mozilla/jss/crypto/NoSuchItemOnTokenException.java:1.3=before.
Treated as 'before'



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  0:16                       ` Jon Smirl
@ 2006-06-10  0:45                         ` Jon Smirl
  0 siblings, 0 replies; 69+ messages in thread
From: Jon Smirl @ 2006-06-10  0:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git

They must be running some kind of process accounting at my host. As
soon as I hit 500MB RAM I get killed immediately. It is not from a
signal, I'm catching all of those. Maybe some kind of process
accounting.

I get this on the console:
[1]+  Killed
CVSROOT=~/jonsmirl.dreamhosters.com/mozilla/ cvsps -x --norc -A
mozilla >mozilla.cvsps 2>mozilla.cvspserr

and nothing on stdout or stderr.

kernel string:
 2.4.29-grsec+w+fhs6b+gr0501+nfs+a32+++p4+sata+c4+gr2b-v6.189

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 20:17       ` Jon Smirl
  2006-06-09 20:40         ` Linus Torvalds
@ 2006-06-09 20:44         ` Jakub Narebski
  2006-06-09 21:05         ` Nicolas Pitre
  2006-06-10  1:23         ` Martin Langhoff
  3 siblings, 0 replies; 69+ messages in thread
From: Jakub Narebski @ 2006-06-09 20:44 UTC (permalink / raw)
  To: git

Jon Smirl wrote:

> Martin has also brought up the problem with needing a partial clone so
> that everyone doesn't have to bring down the entire repository. A
> trunk checkout is 340MB and Martin's git tree is 2GB (mine 2.7GB).  A
> kernel tree is only 680M.

Partial/shallow nor lazy clone we don't have (although there might be some
shallow clone partial solutions in topic branches and/or patches flying
around in git mailing list). Yet.

But you can do what was done for Linux kernel: split repository into current
and historical, and you can join them (join the history) if needed using
grafts. And even if one need historical repository, it is neede to
clone/copy only _once_. With alternatives (using historical repository as
one of alternatives for current repository) someone who has both
repositories does need only a little more space, I think, than if one used
single repository.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 20:17       ` Jon Smirl
  2006-06-09 20:40         ` Linus Torvalds
  2006-06-09 20:44         ` Jakub Narebski
@ 2006-06-09 21:05         ` Nicolas Pitre
  2006-06-09 21:46           ` Jon Smirl
  2006-06-10  1:23         ` Martin Langhoff
  3 siblings, 1 reply; 69+ messages in thread
From: Nicolas Pitre @ 2006-06-09 21:05 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Linus Torvalds, Martin Langhoff, git

On Fri, 9 Jun 2006, Jon Smirl wrote:

> I haven't come up with anything that is likely to result in Mozilla
> switching over to git. Right now it takes three days to convert the
> tree. The tree will have to be run in parallel for a while to convince
> everyone to switch. I don't have a solution to keeping it in sync in
> near real time (commits would still go to CVS). Most Mozilla
> developers are interested but the infrastructure needs some help.

This is true.  GIT is still evolving and certainly needs work to cope 
with environments and datasets that were never tested before.  The 
Mozilla repo is one of those and we're certainly interested into making 
it work well.  GIT might not be right for it just yet, but if you could 
let us rsync your converted repo to play with that might help us work on 
proper fixes for that kind of repo.

> Martin has also brought up the problem with needing a partial clone so
> that everyone doesn't have to bring down the entire repository.

If it can be repacked into a single pack that size might get much 
smaller too.

Nicolas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 21:05         ` Nicolas Pitre
@ 2006-06-09 21:46           ` Jon Smirl
  0 siblings, 0 replies; 69+ messages in thread
From: Jon Smirl @ 2006-06-09 21:46 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Martin Langhoff, git

On 6/9/06, Nicolas Pitre <nico@cam.org> wrote:
> On Fri, 9 Jun 2006, Jon Smirl wrote:
>
> > I haven't come up with anything that is likely to result in Mozilla
> > switching over to git. Right now it takes three days to convert the
> > tree. The tree will have to be run in parallel for a while to convince
> > everyone to switch. I don't have a solution to keeping it in sync in
> > near real time (commits would still go to CVS). Most Mozilla
> > developers are interested but the infrastructure needs some help.
>
> This is true.  GIT is still evolving and certainly needs work to cope
> with environments and datasets that were never tested before.  The
> Mozilla repo is one of those and we're certainly interested into making
> it work well.  GIT might not be right for it just yet, but if you could
> let us rsync your converted repo to play with that might help us work on
> proper fixes for that kind of repo.

I'm rebuilding it on my shared hosting account at dreamhost.com. I'll
see if I can get it built before they notice and kill my process. My
account there is on a 4GB quad xeon box so hopefully it can convert
the tree faster. My account has 1TB download per month so rsync will
be ok. Not bad for $12 the first year.

It would take over a day to rsync it off from my home machine.

> > Martin has also brought up the problem with needing a partial clone so
> > that everyone doesn't have to bring down the entire repository.
>
> If it can be repacked into a single pack that size might get much
> smaller too.
>
>
> Nicolas
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09 20:17       ` Jon Smirl
                           ` (2 preceding siblings ...)
  2006-06-09 21:05         ` Nicolas Pitre
@ 2006-06-10  1:23         ` Martin Langhoff
  3 siblings, 0 replies; 69+ messages in thread
From: Martin Langhoff @ 2006-06-10  1:23 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Linus Torvalds, git

On 6/10/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> The git tree that Martin got from cvsps is much smaller that the git
> tree I got from going to svn then to git.  I don't why the trees are
> 700KB different, it may be different amounts of packing, or one of the
> conversion tools is losing something.

Don't read too much into that. Packing/repacking points make a _huge_
difference, and even if one of our trees is a bit corrupt, the
packsizes should be about the same.

(With the patches I sent you we _are_ choosing to ignore a few
branches that don't seem to make sense in cvsps output. These will
show up in the error output -- what I saw were very old, possibly
corrupt branches there, stuff I wouldn't shed a tear over, but it is
worth reviewing).

> I haven't come up with anything that is likely to result in Mozilla
> switching over to git. Right now it takes three days to convert the
> tree. The tree will have to be run in parallel for a while to convince
> everyone to switch. I don't have a solution to keeping it in sync in
> near real time (commits would still go to CVS). Most Mozilla
> developers are interested but the infrastructure needs some help.

Don't worry about the initial import time. Once you've done it, you
can run the incremental import (which will take a few minutes) even
hourly to keep 'in sync'.

> Martin has also brought up the problem with needing a partial clone so
> that everyone doesn't have to bring down the entire repository. A
> trunk checkout is 340MB and Martin's git tree is 2GB (mine 2.7GB).  A
> kernel tree is only 680M.

Now that I have managed to repack the repo, it is indeed back in the
600M range. Actually, I just re-repacked, it took under a minute, and
it shrank down to 607MB.

Yay.

I'm sure that if you git-repack -a -d on a machine with plenty of
memory once or twice, we'll have matching packs.

cheers,

martin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09  3:06 ` Martin Langhoff
  2006-06-09  3:28   ` Jon Smirl
  2006-06-09 18:13   ` Jon Smirl
@ 2006-06-10  1:14   ` Martin Langhoff
  2006-06-10  1:33     ` Linus Torvalds
  2 siblings, 1 reply; 69+ messages in thread
From: Martin Langhoff @ 2006-06-10  1:14 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On 6/9/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> mozilla.git$ du -sh .git/
> 2.0G    .git/

Ok -- pushed the repository out to our mirror box. Try:

   git-clone http://mirrors.catalyst.net.nz/pub/mozilla.git/

Now, good news. No, _very_ good news. As I was rsync'ing this out, and
looking at the repo, suddently something was odd. Apparently after a
git-repack -a -d OOMd on me, and I had posted this message, I re-ran
it.

[As it happens I have been running several imports of gentoo and moz
lately on thebox. It is entirely possible that cvsps or a stray
git-cvsimport was sitting on a whole lot of ram at the time]

Now I don't know how much memory or time this took, but it clearly
completed ok. And, it's now a single pack, weighting a grand total of
617MB

So my comments about OOM'ing were wrong apparently. Hey, if the whole
history is actually only 617MB, then initial checkouts are back to
something reasonable, I'd say.

cheers,

martin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  1:14   ` Martin Langhoff
@ 2006-06-10  1:33     ` Linus Torvalds
  2006-06-10  1:43       ` Linus Torvalds
  2006-06-11 22:00       ` Nicolas Pitre
  0 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2006-06-10  1:33 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git

On Sat, 10 Jun 2006, Martin Langhoff wrote:
> 
> Now I don't know how much memory or time this took, but it clearly
> completed ok. And, it's now a single pack, weighting a grand total of
> 617MB

Ok, that's more than reasonable. That should be fairly easily mapped on a 
32-bit architecture without any huge problems, even with some VM 
fragmentation going on. It might be borderline (and you definitely want a 
3:1 VM user:kernel split), but considering that the original CVS archive 
was apparently 3GB, having a single 617M pack-file is still pretty damn 
good.  That's like 20% of the original, with all the obvious distribution 
advantages.

Clearly this whole thing _does_ show that we could improve the process of 
importing things from CVS a whole lot, and I assume your 617MB pack 
doesn't have the nice name/email translations so it needs to be fixed up, 
but it sounds like on the whole the core git design came through with 
shining colors, even if we may want to polish things up a bit ;)

I'm downloading the thing right now.

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  1:33     ` Linus Torvalds
@ 2006-06-10  1:43       ` Linus Torvalds
  2006-06-10  1:48         ` Jon Smirl
  2006-06-11 22:00       ` Nicolas Pitre
  1 sibling, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-10  1:43 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git

On Fri, 9 Jun 2006, Linus Torvalds wrote:
> 
> That's like 20% of the original, with all the obvious distribution 
> advantages.

Btw, does anybody know roughly how much data a initial "cvs co" takes on 
the mozilla repo? Git will obviously get the whole history, and that will 
inevitably be bigger than getting a single check-out, but it's not 
necessarily orders of magnitude bigger.

It could be that getting a whole git archive is not _that_ much more 
expnsive than getting a single version, considering how well history 
compresses (eg the kernel git arhive isn't orders of magnitude bigger than 
a single compressed tar-ball of the sources).

At that point, it's probably a pretty usable alternative.

(Although, to be fair, we almost certainly have to improve "git-rev-list 
--objects --all" performance on that thing, since that's going to 
otherwise make it totally impossible to do initial clones using the native 
git protocol, and make git look bad).

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  1:43       ` Linus Torvalds
@ 2006-06-10  1:48         ` Jon Smirl
  2006-06-10  1:59           ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Jon Smirl @ 2006-06-10  1:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git

On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Fri, 9 Jun 2006, Linus Torvalds wrote:
> >
> > That's like 20% of the original, with all the obvious distribution
> > advantages.
>
> Btw, does anybody know roughly how much data a initial "cvs co" takes on
> the mozilla repo? Git will obviously get the whole history, and that will
> inevitably be bigger than getting a single check-out, but it's not
> necessarily orders of magnitude bigger.

339MB for initial checkout

> It could be that getting a whole git archive is not _that_ much more
> expnsive than getting a single version, considering how well history
> compresses (eg the kernel git arhive isn't orders of magnitude bigger than
> a single compressed tar-ball of the sources).
>
> At that point, it's probably a pretty usable alternative.
>
> (Although, to be fair, we almost certainly have to improve "git-rev-list
> --objects --all" performance on that thing, since that's going to
> otherwise make it totally impossible to do initial clones using the native
> git protocol, and make git look bad).
>
>                         Linus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  1:48         ` Jon Smirl
@ 2006-06-10  1:59           ` Linus Torvalds
  2006-06-10  2:21             ` Jon Smirl
                               ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Linus Torvalds @ 2006-06-10  1:59 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, git

On Fri, 9 Jun 2006, Jon Smirl wrote:
> > 
> > Btw, does anybody know roughly how much data a initial "cvs co" takes on
> > the mozilla repo? Git will obviously get the whole history, and that will
> > inevitably be bigger than getting a single check-out, but it's not
> > necessarily orders of magnitude bigger.
> 
> 339MB for initial checkout

And I think people run :pserver: with compression by default, so we're 
likely talking about half that in actual download overhead, no?

So a git clone would be about (wild handwaving, don't look at all the 
assumptions) four times as expensive - assuming we only look at a poor DSL 
line as the expense - as an initial CVS co, but you'd get the _whole_ 
history. Which may or may not make up for it. For some people it will, for 
others it won't.

Of course, to make up for some of the initial costs, I suspect that some 
people who are used to "cvs update" taking 15 minutes to update two files, 
it would be a serious relief to see the git kind of "300 objects in five 
seconds" kinds of pulls.

Although I guess that's one of the CVS things that SVN improved on. At 
least I'd hope so ;/

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  1:59           ` Linus Torvalds
@ 2006-06-10  2:21             ` Jon Smirl
  2006-06-10  2:34               ` Carl Worth
  2006-06-10  3:01               ` Linus Torvalds
  2006-06-10  2:30             ` Jon Smirl
  2006-06-10  3:41             ` Martin Langhoff
  2 siblings, 2 replies; 69+ messages in thread
From: Jon Smirl @ 2006-06-10  2:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git

On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Fri, 9 Jun 2006, Jon Smirl wrote:
> > >
> > > Btw, does anybody know roughly how much data a initial "cvs co" takes on
> > > the mozilla repo? Git will obviously get the whole history, and that will
> > > inevitably be bigger than getting a single check-out, but it's not
> > > necessarily orders of magnitude bigger.
> >
> > 339MB for initial checkout
>
> And I think people run :pserver: with compression by default, so we're
> likely talking about half that in actual download overhead, no?
>
> So a git clone would be about (wild handwaving, don't look at all the
> assumptions) four times as expensive - assuming we only look at a poor DSL
> line as the expense - as an initial CVS co, but you'd get the _whole_
> history. Which may or may not make up for it. For some people it will, for
> others it won't.

Could you clone the repo and delete changesets earlier than 2004? Then
I would clone the small repo and work with it. Later I decide I want
full history, can I pull from a full repository at that point and get
updated? That would need a flag to trigger it since I don't want full
history to come over if I am just getting updates from someone else's
tree that has a full history.

>
> Of course, to make up for some of the initial costs, I suspect that some
> people who are used to "cvs update" taking 15 minutes to update two files,
> it would be a serious relief to see the git kind of "300 objects in five
> seconds" kinds of pulls.

No more cvs diff taking four minutes to finish. I have to do that
every time I want to generate a 10 line patch. Diffs can run locally.
No more cvs update to replace files I deleted because I messed up
edits in them. And I can have local branches, yeah!

What are we going to do about the BEOS developers on Mozilla? There
are a couple more obscure OSes.

> Although I guess that's one of the CVS things that SVN improved on. At
> least I'd hope so ;/
>
>                         Linus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  2:21             ` Jon Smirl
@ 2006-06-10  2:34               ` Carl Worth
  2006-06-10  3:08                 ` Linus Torvalds
  2006-06-10  3:01               ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Carl Worth @ 2006-06-10  2:34 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Linus Torvalds, Martin Langhoff, git

[-- Attachment #1: Type: text/plain, Size: 1121 bytes --]

On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote:
> 
> Could you clone the repo and delete changesets earlier than 2004? Then
> I would clone the small repo and work with it. Later I decide I want
> full history, can I pull from a full repository at that point and get
> updated? That would need a flag to trigger it since I don't want full
> history to come over if I am just getting updates from someone else's
> tree that has a full history.

This is clearly a desirable feature, and has been requested by several
people (including myself) looking to switch some large-ish histories
from an existing system to git.

If you'd like to look through git archives for some discussion of the
issues that would be involved here, look for "shallow clone".

There's a related proposal termed "lazy clone" for one that would pull
down missing objects as needed over the network.

My impression is that both things will eventually be implemented.
There's certainly nothing fundamental in git that will prevent them,
(though there will be some interesting things to resolve as a real
patch for this stuff is explored).

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  2:34               ` Carl Worth
@ 2006-06-10  3:08                 ` Linus Torvalds
  2006-06-10  8:21                   ` Jakub Narebski
  2006-06-10  8:36                   ` Rogan Dawes
  0 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2006-06-10  3:08 UTC (permalink / raw)
  To: Carl Worth; +Cc: Jon Smirl, Martin Langhoff, git

On Fri, 9 Jun 2006, Carl Worth wrote:

> On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote:
> > 
> > Could you clone the repo and delete changesets earlier than 2004? Then
> > I would clone the small repo and work with it. Later I decide I want
> > full history, can I pull from a full repository at that point and get
> > updated? That would need a flag to trigger it since I don't want full
> > history to come over if I am just getting updates from someone else's
> > tree that has a full history.
> 
> This is clearly a desirable feature, and has been requested by several
> people (including myself) looking to switch some large-ish histories
> from an existing system to git.

The thing is, to some degree it's really fundamentally hard.

It's easy for a linear history. What you do for a linear history is to 
just get the top commit, and the tree associated with it, and then you 
cauterize the parent by just grafting it to go away. Boom. You're done.

The problems are that if the preceding history _wasn't_ linear (or, in 
fact, _subsequent_ development refers to it by having branched off at an 
earlier point), and you try to pull your updates, the other end (that 
knows about all the history) will assume you have all the history that you 
don't have, and will send you a pack assuming that.

Which won't even necessarily have all the tree/blob objects (it assumed 
you already had them), but more annoyingly, the history won't be 
cauterized, and you'll have dangling commits. Which you can cauterize by 
hand, of course, but you literally _will_ have to get the objects and 
cauterize the thing by hand.

You're right that it's not "fundamentally impossible" to do: the git 
format certainly _allows_ it. But the git protocol handshake really does 
end up optimizing away all the unnecessary work by knowing that the other 
side will have all the shared history, so lacking the shared history will 
mean that you're a bit screwed.

Using the http protocol actually works. It doesn't do any handshake: it 
will just fetch objects from the other end as it needs them. The downside, 
of course, is that it also doesn't understand packs, so if the source is 
packed (and it pretty much _will_ be, for any big source), you're going to 
end up getting it all _anyway_.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  3:08                 ` Linus Torvalds
@ 2006-06-10  8:21                   ` Jakub Narebski
  2006-06-10  9:00                     ` Junio C Hamano
  2006-06-10  8:36                   ` Rogan Dawes
  1 sibling, 1 reply; 69+ messages in thread
From: Jakub Narebski @ 2006-06-10  8:21 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:


> On Fri, 9 Jun 2006, Carl Worth wrote:
> 
>> On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote:
>> > 
>> > Could you clone the repo and delete changesets earlier than 2004? Then
>> > I would clone the small repo and work with it. Later I decide I want
>> > full history, can I pull from a full repository at that point and get
>> > updated? That would need a flag to trigger it since I don't want full
>> > history to come over if I am just getting updates from someone else's
>> > tree that has a full history.
>> 
>> This is clearly a desirable feature, and has been requested by several
>> people (including myself) looking to switch some large-ish histories
>> from an existing system to git.
> 
> The thing is, to some degree it's really fundamentally hard.
> 
> It's easy for a linear history. What you do for a linear history is to 
> just get the top commit, and the tree associated with it, and then you 
> cauterize the parent by just grafting it to go away. Boom. You're done.
> 
> The problems are that if the preceding history _wasn't_ linear (or, in 
> fact, _subsequent_ development refers to it by having branched off at an 
> earlier point), and you try to pull your updates, the other end (that 
> knows about all the history) will assume you have all the history that you 
> don't have, and will send you a pack assuming that.

Couldn't it be solved by enhancing initial handshake to send from puller
(object receivier) to pullee (object sender) the contents of graft file, or
better the contents of cauterizing graft file - without splitting graft
file we better have an option to send graft file or not, when graft file is
used to join historical repository line of development not to cauterize
history.

Then the sender would use sent cauterizing history graft file for
calculating which objects to sedn _only_, "in memory" cauterizing it's own
history.

Main disadvantage is if one cauterized history too eagerly, and shallow
clone history can lack merge bases, and have no way to get them _simply_
using this approach...


Now I guess you would tell me why this very simple idea is stupid...

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  8:21                   ` Jakub Narebski
@ 2006-06-10  9:00                     ` Junio C Hamano
  0 siblings, 0 replies; 69+ messages in thread
From: Junio C Hamano @ 2006-06-10  9:00 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> Couldn't it be solved by enhancing initial handshake to send from puller
> (object receivier) to pullee (object sender) the contents of graft file, or
> better the contents of cauterizing graft file - without splitting graft
> file we better have an option to send graft file or not, when graft file is
> used to join historical repository line of development not to cauterize
> history.
>
> Then the sender would use sent cauterizing history graft file for
> calculating which objects to sedn _only_, "in memory" cauterizing it's own
> history.
>
> Now I guess you would tell me why this very simple idea is stupid...

It is not stupid at all; what you said is actually on a correct
track.  You indeed just reinvented a half of what I've outlined
earlier for implementing shallow clone (the other half you
missed is that the graft exchange needs to happen both ways,
limiting the commit ancestry graph the both ends walk to the
intersection of the fake view of the ancestry graph both ends
have, but that is a minor detail).

The problem is that what Linus described as "fundamentally hard"
is not the initial "shallow clone" stage, but lies elsewhere.
Namely, what to do after you create such a shallow clone and
when you want to unplug an earlier cauterization points.

In order to unplug a cauterization point (a commit we faked to
be parentless earlier, whose parents and associated objects we
ought to have but we do not because we made a shallow clone),
the downloader needs to re-fetch that commit while temporarily
pretending that it does not have any objects that are newer,
perhaps defining another earlier point as a new cauterization
point at the same time.  Git format allows for that, and the
protocol exchange certainly can be extensible to support
something like that, but the design work would be quite
involved.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  3:08                 ` Linus Torvalds
  2006-06-10  8:21                   ` Jakub Narebski
@ 2006-06-10  8:36                   ` Rogan Dawes
  2006-06-10  9:08                     ` Junio C Hamano
  2006-06-10 17:53                     ` Linus Torvalds
  1 sibling, 2 replies; 69+ messages in thread
From: Rogan Dawes @ 2006-06-10  8:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, Martin Langhoff, git

Linus Torvalds wrote:
> 
> On Fri, 9 Jun 2006, Carl Worth wrote:
> 
>> On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote:
>>> Could you clone the repo and delete changesets earlier than 2004? Then
>>> I would clone the small repo and work with it. Later I decide I want
>>> full history, can I pull from a full repository at that point and get
>>> updated? That would need a flag to trigger it since I don't want full
>>> history to come over if I am just getting updates from someone else's
>>> tree that has a full history.
>> This is clearly a desirable feature, and has been requested by several
>> people (including myself) looking to switch some large-ish histories
>> from an existing system to git.
> 
> The thing is, to some degree it's really fundamentally hard.
> 
> It's easy for a linear history. What you do for a linear history is to 
> just get the top commit, and the tree associated with it, and then you 
> cauterize the parent by just grafting it to go away. Boom. You're done.
> 
> The problems are that if the preceding history _wasn't_ linear (or, in 
> fact, _subsequent_ development refers to it by having branched off at an 
> earlier point), and you try to pull your updates, the other end (that 
> knows about all the history) will assume you have all the history that you 
> don't have, and will send you a pack assuming that.
> 
> Which won't even necessarily have all the tree/blob objects (it assumed 
> you already had them), but more annoyingly, the history won't be 
> cauterized, and you'll have dangling commits. Which you can cauterize by 
> hand, of course, but you literally _will_ have to get the objects and 
> cauterize the thing by hand.
> 
> You're right that it's not "fundamentally impossible" to do: the git 
> format certainly _allows_ it. But the git protocol handshake really does 
> end up optimizing away all the unnecessary work by knowing that the other 
> side will have all the shared history, so lacking the shared history will 
> mean that you're a bit screwed.

Here's an idea. How about separating trees and commits from the actual 
blobs (e.g. in separate packs)? My reasoning is that the commits and 
trees should only be a small portion of the overall repository size, and 
should not be that expensive to transfer. (Of course, this is only a 
guess, and needs some numbers to back it up.)

So, a shallow clone would receive all of the tree objects, and all of 
the commit objects, and could then request a pack containing the blobs 
represented by the current HEAD.

In this way, the user has a history that will show all of the commit 
messages, and would be able to see _which_ files have changed over time 
e.g. gitk would still work - except for the actual file level diff, "git 
log" should also still work, etc

This would also enable other optimisations.

For example, documentation people would only need to get the objects 
under the doc/ tree, and would not need to actually check out the 
source. Git could detect any actual changes by checking whether it has 
the previous blob in its local repository, and whether the file exists 
locally. Creating a patch would obviously require that the person checks 
out the previous version, but one could theoretically commit a new blob 
to a repo without having the previous one (not saying that this would be 
a good idea, of course)

This would probably require Eric Biederman's "direct access to blob" 
patches, I guess, in order to be feasible.

Regards,

Rogan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  8:36                   ` Rogan Dawes
@ 2006-06-10  9:08                     ` Junio C Hamano
  2006-06-10 14:47                       ` Rogan Dawes
  2006-06-10 17:53                     ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Junio C Hamano @ 2006-06-10  9:08 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: git

Rogan Dawes <lists@dawes.za.net> writes:

> Here's an idea. How about separating trees and commits from the actual
> blobs (e.g. in separate packs)?

If I remember my numbers correctly, trees for any project with a
size that matters contribute nonnegligible amount of the total
pack weight.  Perhaps 10-25%.

> In this way, the user has a history that will show all of the commit
> messages, and would be able to see _which_ files have changed over
> time e.g. gitk would still work - except for the actual file level
> diff, "git log" should also still work, etc

I suspect it would make a very unpleasant system to use.
Sometimes "git diff -p" would show diffs, and other times it
mysteriously complain saying that it lacks necessary blobs to do
its job.  You cannot even run fsck and tell from its output
which missing objects are OK (because you chose to create such a
sparse repository) and which are real corruption.

A shallow clone with explicit cauterization in grafts file at
least would not have that problem. Although the user will still
not see the exact same result as what would happen in a full
repository, at least we can say "your git log ends at that
commit because your copy of the history does not go back beyond
that" and the user would understand.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  9:08                     ` Junio C Hamano
@ 2006-06-10 14:47                       ` Rogan Dawes
  2006-06-10 14:58                         ` Jakub Narebski
  2006-06-10 15:14                         ` Nicolas Pitre
  0 siblings, 2 replies; 69+ messages in thread
From: Rogan Dawes @ 2006-06-10 14:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano wrote:
> Rogan Dawes <lists@dawes.za.net> writes:
> 
>> Here's an idea. How about separating trees and commits from the actual
>> blobs (e.g. in separate packs)?
> 
> If I remember my numbers correctly, trees for any project with a
> size that matters contribute nonnegligible amount of the total
> pack weight.  Perhaps 10-25%.

Out of curiosity, do you think that it may be possible for tree objects 
to compress more/better if they are packed together? Or does the 
existing pack compression logic already do the diff against similar tree 
objects?

>> In this way, the user has a history that will show all of the commit
>> messages, and would be able to see _which_ files have changed over
>> time e.g. gitk would still work - except for the actual file level
>> diff, "git log" should also still work, etc
> 
> I suspect it would make a very unpleasant system to use.
> Sometimes "git diff -p" would show diffs, and other times it
> mysteriously complain saying that it lacks necessary blobs to do
> its job.  You cannot even run fsck and tell from its output
> which missing objects are OK (because you chose to create such a
> sparse repository) and which are real corruption.

The fsck problem could be worked around by maintaining a list of objects 
that are explicitly not expected to be present. As the list gets shorter 
(perhaps as diffs are performed, other parts of the blob history are 
retrieved, etc), the list will get shorter until we have a complete 
clone of the original tree.

Of course diffs against a version further back in the history would 
fail. But if you start with a checkout of a complete tree, any changes 
made since that point would at least have one version to compare against.

In effect, what we would have is a caching repository (or as Jakub said, 
a lazy clone). An initial checkout would effectively be pre-seeding the 
cache. One does not necessarily even need to get the complete set of 
commit and tree objects, either. The bare minimum would probably be to 
get the HEAD commit, and the tree objects that correspond to that commit.

At that point, one could populate the "uncached objects" list with the 
parent commits. One would not be in a position to get any history at 
all, of course.

As the user performs various operations, e.g. git log, git could either 
go and fetch the necessary objects (updating the uncached list as it 
goes), or fail with a message such as "Cannot perform the requested 
operation - required objects are not available". (We may require another 
utility that would list the objects required for an operation, and 
compare it against the list of "uncached objects", printing out a list 
of which are not yet available locally. I realise that this may be 
expensive. Maybe a repo configuration option "cached" to enable or 
disable this.)

As Jakub suggested, it would be necessary to configure the location of 
the source for any missing objects, but that is probably in the repo 
config anyway.

> A shallow clone with explicit cauterization in grafts file at
> least would not have that problem. Although the user will still
> not see the exact same result as what would happen in a full
> repository, at least we can say "your git log ends at that
> commit because your copy of the history does not go back beyond
> that" and the user would understand.

Or, we could say, perform the operation while you are online, and can 
access the necessary objects. If the user has explicitly chosen to make 
a lazy clone, then they should expect that at some point, whatever they 
do may require them to be online to access items that they have not yet 
cloned.

Rogan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10 14:47                       ` Rogan Dawes
@ 2006-06-10 14:58                         ` Jakub Narebski
  2006-06-10 15:14                         ` Nicolas Pitre
  1 sibling, 0 replies; 69+ messages in thread
From: Jakub Narebski @ 2006-06-10 14:58 UTC (permalink / raw)
  To: git

Rogan Dawes wrote:

> Junio C Hamano wrote:
>> Rogan Dawes <lists@dawes.za.net> writes:
>> 
>>> Here's an idea. How about separating trees and commits from the actual
>>> blobs (e.g. in separate packs)?
>> 
>> If I remember my numbers correctly, trees for any project with a
>> size that matters contribute nonnegligible amount of the total
>> pack weight.  Perhaps 10-25%.
> 
> Out of curiosity, do you think that it may be possible for tree objects 
> to compress more/better if they are packed together? Or does the 
> existing pack compression logic already do the diff against similar tree 
> objects?

The problem with compressing and deltafying trees is with sha1 objects
identifiers, I guess.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10 14:47                       ` Rogan Dawes
  2006-06-10 14:58                         ` Jakub Narebski
@ 2006-06-10 15:14                         ` Nicolas Pitre
  1 sibling, 0 replies; 69+ messages in thread
From: Nicolas Pitre @ 2006-06-10 15:14 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: Junio C Hamano, git

On Sat, 10 Jun 2006, Rogan Dawes wrote:

> Out of curiosity, do you think that it may be possible for tree objects to
> compress more/better if they are packed together? Or does the existing pack
> compression logic already do the diff against similar tree objects?

Tree objects for the same directories are already packed and deltified 
against each other in a pack.


Nicolas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  8:36                   ` Rogan Dawes
  2006-06-10  9:08                     ` Junio C Hamano
@ 2006-06-10 17:53                     ` Linus Torvalds
  2006-06-10 18:02                       ` Jon Smirl
  2006-06-10 18:36                       ` Rogan Dawes
  1 sibling, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2006-06-10 17:53 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: Jon Smirl, Martin Langhoff, git

On Sat, 10 Jun 2006, Rogan Dawes wrote:
>
> Here's an idea. How about separating trees and commits from the actual blobs
> (e.g. in separate packs)? My reasoning is that the commits and trees should
> only be a small portion of the overall repository size, and should not be that
> expensive to transfer. (Of course, this is only a guess, and needs some
> numbers to back it up.)

The trees in particular are actually a pretty big part of the history. 

More importantly, the blobs compress horribly badly in the absense of 
history - a _lot_ of the compression in git packing comes very much from 
the fact that we do a good job at delta-compression.

So if you get all of the commit/tree history, but none of the blob 
history, you're actually not going to win that much space. As already 
discussed, the _whole_ history packed with git is usually not insanely 
bigger than just the whole unpacked tree (with no history at all).

So you'd think that getting just the top version of the tree would be a 
much bigger space-saving that it actually is. If you _also_ get all the 
tree and commit objects, the space saving is even less.

I actually suspect that the most realistic way to handle this is to use 
the "fetch.c" logic (ie the incremental fetcher used by http), and add 
some mode to the git daemon where you fetch literally one object at a time 
(ie this would be totally _separate_ from the pack-file thing: you'd not 
ask for "git-upload-pack", you'd ask for something like 
"git-serve-objects" instead).

The fetch.c logic really does allow for on-demand object fetching, and is 
thus much more suitable for incomplete repositories.

HOWEVER. The fetch.c logic - by necessity - works on a object-by-object 
level. That means that you'd get no delta compression AT ALL, and I 
suspect that the downside of that would be a factor of ten expansion or 
more, which means that it would really not work that well in practice.

It might be worth testing, though. It would work fine for the "after I 
have the initial cauterized tree, fetch small incremental updates" case. 
The operative word here being "small" and "incremental", because I'm 
pretty sure it really would suck for the case of a big fetch.

But it would be _simple_, which is why it's worth trying out. It also has 
the advantage that it would solve the "I had data corruption on my disk, 
and lost 100 objects, but all the the rest is fine" issue. Again, that's 
not something that the efficient packing protocol handles, exactly because 
it assumes full history, and uses that to do all its optimizations.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10 17:53                     ` Linus Torvalds
@ 2006-06-10 18:02                       ` Jon Smirl
  2006-06-10 18:36                       ` Rogan Dawes
  1 sibling, 0 replies; 69+ messages in thread
From: Jon Smirl @ 2006-06-10 18:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rogan Dawes, Martin Langhoff, git

Here's a random idea, how about a tool that turns a real pack into one
that is segmented and then faults in segments if you do an operation
that needs the old segments? The full pack would always look like it
is there even if it isn't. Something like gitk would be modified not
to fault in the missing segments.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10 17:53                     ` Linus Torvalds
  2006-06-10 18:02                       ` Jon Smirl
@ 2006-06-10 18:36                       ` Rogan Dawes
  1 sibling, 0 replies; 69+ messages in thread
From: Rogan Dawes @ 2006-06-10 18:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, Martin Langhoff, git

Linus Torvalds wrote:
> 
> On Sat, 10 Jun 2006, Rogan Dawes wrote:
>> Here's an idea. How about separating trees and commits from the actual blobs
>> (e.g. in separate packs)? My reasoning is that the commits and trees should
>> only be a small portion of the overall repository size, and should not be that
>> expensive to transfer. (Of course, this is only a guess, and needs some
>> numbers to back it up.)
> 
> The trees in particular are actually a pretty big part of the history. 
> 
> More importantly, the blobs compress horribly badly in the absense of 
> history - a _lot_ of the compression in git packing comes very much from 
> the fact that we do a good job at delta-compression.
> 
> So if you get all of the commit/tree history, but none of the blob 
> history, you're actually not going to win that much space. As already 
> discussed, the _whole_ history packed with git is usually not insanely 
> bigger than just the whole unpacked tree (with no history at all).
> 
> So you'd think that getting just the top version of the tree would be a 
> much bigger space-saving that it actually is. If you _also_ get all the 
> tree and commit objects, the space saving is even less.
> 

One possibility, given that the full commit and tree history is so
large, is simply to get the HEAD commit and the trees that the commit
depends directly on, rather than fetching them all up front.

> I actually suspect that the most realistic way to handle this is to use 
> the "fetch.c" logic (ie the incremental fetcher used by http), and add 
> some mode to the git daemon where you fetch literally one object at a time 
> (ie this would be totally _separate_ from the pack-file thing: you'd not 
> ask for "git-upload-pack", you'd ask for something like 
> "git-serve-objects" instead).
> 
> The fetch.c logic really does allow for on-demand object fetching, and is 
> thus much more suitable for incomplete repositories.
> 
> HOWEVER. The fetch.c logic - by necessity - works on a object-by-object 
> level. That means that you'd get no delta compression AT ALL, and I 
> suspect that the downside of that would be a factor of ten expansion or 
> more, which means that it would really not work that well in practice.

Would it be possible to add a mode where fetch.c is given a list of 
desired objects, and returns a list of pointers to those objects? Then 
callers that already have such a list could be modified to pass the 
whole list at once, allowing at least SOME compression, and optimisation 
of round trips, etc? There would be a tradeoff in memory use, though, I 
guess.

Rogan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  2:21             ` Jon Smirl
  2006-06-10  2:34               ` Carl Worth
@ 2006-06-10  3:01               ` Linus Torvalds
  1 sibling, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2006-06-10  3:01 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, git



On Fri, 9 Jun 2006, Jon Smirl wrote:
>
> No more cvs diff taking four minutes to finish. I have to do that
> every time I want to generate a 10 line patch. Diffs can run locally.
> No more cvs update to replace files I deleted because I messed up
> edits in them. And I can have local branches, yeah!

More importantly, when the CVS server is down (can you say 
"sourceforge"?), who cares?

> What are we going to do about the BEOS developers on Mozilla? There
> are a couple more obscure OSes.

Well, the git cvsserver exporter apparently works well enough...

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  1:59           ` Linus Torvalds
  2006-06-10  2:21             ` Jon Smirl
@ 2006-06-10  2:30             ` Jon Smirl
  2006-06-10  3:41             ` Martin Langhoff
  2 siblings, 0 replies; 69+ messages in thread
From: Jon Smirl @ 2006-06-10  2:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git

On 6/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Fri, 9 Jun 2006, Jon Smirl wrote:
> > >
> > > Btw, does anybody know roughly how much data a initial "cvs co" takes on
> > > the mozilla repo? Git will obviously get the whole history, and that will
> > > inevitably be bigger than getting a single check-out, but it's not
> > > necessarily orders of magnitude bigger.
> >
> > 339MB for initial checkout

I ran the checkout through bzip and it is 36.4MB, 46.4MB with zip.
So the ratio may be 15 to 1 for the cvs co vs git

> And I think people run :pserver: with compression by default, so we're
> likely talking about half that in actual download overhead, no?
>
> So a git clone would be about (wild handwaving, don't look at all the
> assumptions) four times as expensive - assuming we only look at a poor DSL
> line as the expense - as an initial CVS co, but you'd get the _whole_
> history. Which may or may not make up for it. For some people it will, for
> others it won't.
>
> Of course, to make up for some of the initial costs, I suspect that some
> people who are used to "cvs update" taking 15 minutes to update two files,
> it would be a serious relief to see the git kind of "300 objects in five
> seconds" kinds of pulls.
>
> Although I guess that's one of the CVS things that SVN improved on. At
> least I'd hope so ;/
>
>                         Linus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  1:59           ` Linus Torvalds
  2006-06-10  2:21             ` Jon Smirl
  2006-06-10  2:30             ` Jon Smirl
@ 2006-06-10  3:41             ` Martin Langhoff
  2006-06-10  3:55               ` Junio C Hamano
  2006-06-10  4:02               ` Linus Torvalds
  2 siblings, 2 replies; 69+ messages in thread
From: Martin Langhoff @ 2006-06-10  3:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, git

On 6/10/06, Linus Torvalds <torvalds@osdl.org> wrote:
> On Fri, 9 Jun 2006, Jon Smirl wrote:
> > >
> > > Btw, does anybody know roughly how much data a initial "cvs co" takes on
> > > the mozilla repo? Git will obviously get the whole history, and that will
> > > inevitably be bigger than getting a single check-out, but it's not
> > > necessarily orders of magnitude bigger.
> >
> > 339MB for initial checkout
>
> And I think people run :pserver: with compression by default, so we're
> likely talking about half that in actual download overhead, no?

Yes, most people have -z3, and I agree with you, on paper it sounds
like the cost is 1/4 of a git clone.

However.

The CVS protocol is very chatty because the client _acts_ extremely
stupid. It says, ok, I got here an empty directory, and the server
walks the client through every little step. And all that chatter is
uncompressed cleartext under pserver.

So the per-file and per-directory overhead are significant. I can do a
cvs checkout via pserver:localhost but I don't know off-the-cuff how
to measure the traffic. Hints?

cheers,

martin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  3:41             ` Martin Langhoff
@ 2006-06-10  3:55               ` Junio C Hamano
  2006-06-10  4:02               ` Linus Torvalds
  1 sibling, 0 replies; 69+ messages in thread
From: Junio C Hamano @ 2006-06-10  3:55 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> Yes, most people have -z3, and I agree with you, on paper it sounds
> like the cost is 1/4 of a git clone.
>
> However.
>
> The CVS protocol is very chatty because the client _acts_ extremely
> stupid. It says, ok, I got here an empty directory, and the server
> walks the client through every little step. And all that chatter is
> uncompressed cleartext under pserver.
>
> So the per-file and per-directory overhead are significant. I can do a
> cvs checkout via pserver:localhost but I don't know off-the-cuff how
> to measure the traffic. Hints?

If you have an otherwise unused interface, you can look at
ifconfig output and see RX/TX bytes?  But that sounds very
crude.

Running it through a proxy perhaps?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  3:41             ` Martin Langhoff
  2006-06-10  3:55               ` Junio C Hamano
@ 2006-06-10  4:02               ` Linus Torvalds
  2006-06-10  4:11                 ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-10  4:02 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git

On Sat, 10 Jun 2006, Martin Langhoff wrote:
> 
> So the per-file and per-directory overhead are significant. I can do a
> cvs checkout via pserver:localhost but I don't know off-the-cuff how
> to measure the traffic. Hints?

Over localhost, you won't see the biggest issue, which is just latency.

The git protocol should be absolutely <i>wonderful</i> with bad latency, 
because once the early bakc-and-forth on what each side has is done, 
there's no synchronization any more - it's all just streaming, with 
full-frame TCP.

If :pserver: does per-file "hey, what are you up to" kind of 
syncronization, the big killer would be the latency from one end to the 
other, regardless of any throughput.

You can try to approximate the latency by just looking at the number of 
packets, and using a large MTU (and on localhost, the MTU will be pretty 
large - roughly 16kB. Don't count packet size at all, just count how many 
packets each protocol sends (both ways), ignoring packets that are just 
empty ACK's.

I don't know how to build a tcpdump expression for "TCP packet with an 
empty payload", but I bet it's possible.

[ And I won't guarantee that it's a wonderful approximation for "network 
  cost", but I think it's potentially a reasonably good one. It's totally 
  realistic to equate 32kB of _streaming_ data (two packets flowing in 
  one direction with no synchronization) with just a single byte of data 
  going back-and-forth synchronously ]

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  4:02               ` Linus Torvalds
@ 2006-06-10  4:11                 ` Linus Torvalds
  2006-06-10  6:02                   ` Jon Smirl
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-10  4:11 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git

On Fri, 9 Jun 2006, Linus Torvalds wrote:
> 
> You can try to approximate the latency by just looking at the number of 
> packets, and using a large MTU (and on localhost, the MTU will be pretty 
> large - roughly 16kB. Don't count packet size at all, just count how many 
> packets each protocol sends (both ways), ignoring packets that are just 
> empty ACK's.

Btw, the reason you should ignore empty acks is that they happen when you 
have a nice streaming one-way thing, because the TCP rules say that you 
should send an ACK every two full packets minimum, even if you have 
nothing to say.

So empty acks really approximate to "streaming data", while packets with 
payload _could_ obviously mean "nice streaming data going both ways", but 
almost always end up being synchronization discussion of some sort.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  4:11                 ` Linus Torvalds
@ 2006-06-10  6:02                   ` Jon Smirl
  2006-06-10  6:15                     ` Junio C Hamano
  0 siblings, 1 reply; 69+ messages in thread
From: Jon Smirl @ 2006-06-10  6:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git

Here's a new transport problem. When using git-clone to fetch Martin's
tree it kept failing for me at dreamhost. I had a parallel fetch
running on my local machine which has a much slower net connection. It
finally finished and I am watching the end phase where it prints all
of the 'walk' messages. The git-http-fetch process has jumped up to
800MB in size after being 2MB during the download. dreamhost has a
500MB process size limit so that is why my fetches kept failing there.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  6:02                   ` Jon Smirl
@ 2006-06-10  6:15                     ` Junio C Hamano
  2006-06-10 15:44                       ` Jon Smirl
  0 siblings, 1 reply; 69+ messages in thread
From: Junio C Hamano @ 2006-06-10  6:15 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

"Jon Smirl" <jonsmirl@gmail.com> writes:

> Here's a new transport problem. When using git-clone to fetch Martin's
> tree it kept failing for me at dreamhost. I had a parallel fetch
> running on my local machine which has a much slower net connection. It
> finally finished and I am watching the end phase where it prints all
> of the 'walk' messages. The git-http-fetch process has jumped up to
> 800MB in size after being 2MB during the download. dreamhost has a
> 500MB process size limit so that is why my fetches kept failing there.

The http-fetch process uses by mmaping the downloaded pack, and
if I recall correctly we are talking about 600MB pack, so 500MB
limit sounds impossible, perhaps?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  6:15                     ` Junio C Hamano
@ 2006-06-10 15:44                       ` Jon Smirl
  2006-06-10 16:15                         ` Timo Hirvonen
                                           ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Jon Smirl @ 2006-06-10 15:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 6/10/06, Junio C Hamano <junkio@cox.net> wrote:
> "Jon Smirl" <jonsmirl@gmail.com> writes:
>
> > Here's a new transport problem. When using git-clone to fetch Martin's
> > tree it kept failing for me at dreamhost. I had a parallel fetch
> > running on my local machine which has a much slower net connection. It
> > finally finished and I am watching the end phase where it prints all
> > of the 'walk' messages. The git-http-fetch process has jumped up to
> > 800MB in size after being 2MB during the download. dreamhost has a
> > 500MB process size limit so that is why my fetches kept failing there.
>
> The http-fetch process uses by mmaping the downloaded pack, and
> if I recall correctly we are talking about 600MB pack, so 500MB
> limit sounds impossible, perhaps?

The fetch on my local machine failed too. It left nothing behind, now
I have to download the 680MB again.

walk 1f19465388a4ef7aff7527a13f16122a809487d4
walk c3ca840256e3767d08c649f8d2761a1a887351ab
walk 7a74e42699320c02b814b88beadb1ae65009e745
error: Couldn't get
http://mirrors.catalyst.net.nz/pub/mozilla.git//refs/tags/JS%5F1%5F7%5FALPHA%5FBASE
for tags/JS_1_7_ALPHA_BASE
Couldn't resolve host 'mirrors.catalyst.net.nz'
error: Could not interpret tags/JS_1_7_ALPHA_BASE as something to pull
[jonsmirl@jonsmirl mozgit]$ cg update
There is no GIT repository here (.git not found)
[jonsmirl@jonsmirl mozgit]$ ls -a
.  ..
[jonsmirl@jonsmirl mozgit]$

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10 15:44                       ` Jon Smirl
@ 2006-06-10 16:15                         ` Timo Hirvonen
  2006-06-10 18:37                         ` Petr Baudis
  2006-06-10 18:55                         ` Lars Johannsen
  2 siblings, 0 replies; 69+ messages in thread
From: Timo Hirvonen @ 2006-06-10 16:15 UTC (permalink / raw)
  To: Jon Smirl; +Cc: junkio, git

"Jon Smirl" <jonsmirl@gmail.com> wrote:

> On 6/10/06, Junio C Hamano <junkio@cox.net> wrote:
> > "Jon Smirl" <jonsmirl@gmail.com> writes:
> >
> > > Here's a new transport problem. When using git-clone to fetch Martin's
> > > tree it kept failing for me at dreamhost. I had a parallel fetch
> > > running on my local machine which has a much slower net connection. It
> > > finally finished and I am watching the end phase where it prints all
> > > of the 'walk' messages. The git-http-fetch process has jumped up to
> > > 800MB in size after being 2MB during the download. dreamhost has a
> > > 500MB process size limit so that is why my fetches kept failing there.
> >
> > The http-fetch process uses by mmaping the downloaded pack, and
> > if I recall correctly we are talking about 600MB pack, so 500MB
> > limit sounds impossible, perhaps?
> 
> The fetch on my local machine failed too. It left nothing behind, now
> I have to download the 680MB again.

That's sad.  Could git-clone be changed to not remove .git directory if
fetching objects fails (after other files in the .git directory have
been fetched)?  You could then hopefully continue with git-pull.

-- 
http://onion.dynserv.net/~timo/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10 15:44                       ` Jon Smirl
  2006-06-10 16:15                         ` Timo Hirvonen
@ 2006-06-10 18:37                         ` Petr Baudis
  2006-06-10 18:55                         ` Lars Johannsen
  2 siblings, 0 replies; 69+ messages in thread
From: Petr Baudis @ 2006-06-10 18:37 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Junio C Hamano, git

Dear diary, on Sat, Jun 10, 2006 at 05:44:58PM CEST, I got a letter
where Jon Smirl <jonsmirl@gmail.com> said that...
> The fetch on my local machine failed too. It left nothing behind, now
> I have to download the 680MB again.
> 
> walk 1f19465388a4ef7aff7527a13f16122a809487d4
> walk c3ca840256e3767d08c649f8d2761a1a887351ab
> walk 7a74e42699320c02b814b88beadb1ae65009e745
> error: Couldn't get
> http://mirrors.catalyst.net.nz/pub/mozilla.git//refs/tags/JS%5F1%5F7%5FALPHA%5FBASE
> for tags/JS_1_7_ALPHA_BASE
> Couldn't resolve host 'mirrors.catalyst.net.nz'
> error: Could not interpret tags/JS_1_7_ALPHA_BASE as something to pull
> [jonsmirl@jonsmirl mozgit]$ cg update
> There is no GIT repository here (.git not found)
> [jonsmirl@jonsmirl mozgit]$ ls -a
> .  ..
> [jonsmirl@jonsmirl mozgit]$

  You could try with cg-clone, which won't delete the repository if
things fail. It will clone only the master branch, though.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
A person is just about as big as the things that make them angry.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10 15:44                       ` Jon Smirl
  2006-06-10 16:15                         ` Timo Hirvonen
  2006-06-10 18:37                         ` Petr Baudis
@ 2006-06-10 18:55                         ` Lars Johannsen
  2 siblings, 0 replies; 69+ messages in thread
From: Lars Johannsen @ 2006-06-10 18:55 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On (10/06/06 11:44), Jon Smirl wrote:
> Date:	Sat, 10 Jun 2006 11:44:58 -0400
> From:	"Jon Smirl" <jonsmirl@gmail.com>
> To:	"Junio C Hamano" <junkio@cox.net>
> Subject: Re: Figured out how to get Mozilla into git
> Cc:	git@vger.kernel.org
> 
> On 6/10/06, Junio C Hamano <junkio@cox.net> wrote:
> >"Jon Smirl" <jonsmirl@gmail.com> writes:
> >
> >> Here's a new transport problem. When using git-clone to fetch Martin's
> >> tree it kept failing for me at dreamhost. I had a parallel fetch
> >> running on my local machine which has a much slower net connection. It
> >> finally finished and I am watching the end phase where it prints all
> >> of the 'walk' messages. The git-http-fetch process has jumped up to
> >> 800MB in size after being 2MB during the download. dreamhost has a
> >> 500MB process size limit so that is why my fetches kept failing there.
> >
> >The http-fetch process uses by mmaping the downloaded pack, and
> >if I recall correctly we are talking about 600MB pack, so 500MB
> >limit sounds impossible, perhaps?
> 
> The fetch on my local machine failed too. It left nothing behind, now
> I have to download the 680MB again.
> 
> walk 1f19465388a4ef7aff7527a13f16122a809487d4
> walk c3ca840256e3767d08c649f8d2761a1a887351ab
> walk 7a74e42699320c02b814b88beadb1ae65009e745
> error: Couldn't get
> http://mirrors.catalyst.net.nz/pub/mozilla.git//refs/tags/JS%5F1%5F7%5FALPHA%5FBASE
> for tags/JS_1_7_ALPHA_BASE
> Couldn't resolve host 'mirrors.catalyst.net.nz'
> error: Could not interpret tags/JS_1_7_ALPHA_BASE as something to pull
> [jonsmirl@jonsmirl mozgit]$ cg update
> There is no GIT repository here (.git not found)
> [jonsmirl@jonsmirl mozgit]$ ls -a
> .  ..
> [jonsmirl@jonsmirl mozgit]$

To prevent repeat (on this repo) your could grab it with a browser:
-mkdir tmp; cd tmp; git init-db;
-copy  mirror../pu/mozilla.git/objects/*  to .git/objects/
-copy   --||---.git/info/refs to refsinfo in tmp-dir
gawk '{if  ($2 !~ /\^\{\}$/) print $1 > sprintf(".git/%s",$2);}' refsinfo
 to extract branches and tags into ./git/refs/{heads,tags}
start playing (after a backup) with git-fsck-objects, git-checkout etc.
 
-- 
Lars Johannsen 
mail@Lars-johannsen.dk

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-10  1:33     ` Linus Torvalds
  2006-06-10  1:43       ` Linus Torvalds
@ 2006-06-11 22:00       ` Nicolas Pitre
  2006-06-18 19:26         ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Nicolas Pitre @ 2006-06-11 22:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, Jon Smirl, git

On Fri, 9 Jun 2006, Linus Torvalds wrote:

> 
> 
> On Sat, 10 Jun 2006, Martin Langhoff wrote:
> > 
> > Now I don't know how much memory or time this took, but it clearly
> > completed ok. And, it's now a single pack, weighting a grand total of
> > 617MB
> 
> Ok, that's more than reasonable. That should be fairly easily mapped on a 
> 32-bit architecture without any huge problems, even with some VM 
> fragmentation going on. It might be borderline (and you definitely want a 
> 3:1 VM user:kernel split), but considering that the original CVS archive 
> was apparently 3GB, having a single 617M pack-file is still pretty damn 
> good.  That's like 20% of the original, with all the obvious distribution 
> advantages.

I played a bit with git-repack on that repo.  the git-pack-objects 
memory usage grew to around 760MB (git-rev-list was less than that).  So 
LRU of partial pack mappings might bring that down significantly.

Then I used git-repack -a -f --window=20 --depth=20 which produced a 
nice 468MB pack file along with the invariant 45MB index file for a 
grand total of 535MB for the whole repo (the .git/refs/ directory alone 
still occupies 17MB on disk).

So it is probably worth having deeper delta chains for large historic 
repositories as the deep revisions are unlikely to be referenced that 
often while the saving is quite significant.

Nicolas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-11 22:00       ` Nicolas Pitre
@ 2006-06-18 19:26         ` Linus Torvalds
  2006-06-18 21:40           ` Martin Langhoff
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-18 19:26 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Martin Langhoff, Jon Smirl, git



On Sun, 11 Jun 2006, Nicolas Pitre wrote:
> 
> Then I used git-repack -a -f --window=20 --depth=20 which produced a 
> nice 468MB pack file along with the invariant 45MB index file for a 
> grand total of 535MB for the whole repo (the .git/refs/ directory alone 
> still occupies 17MB on disk).

Btw, can others with that mozilla repo confirm that a mozilla repository 
that has been repacked seems to be entirely fine, but git-fsck-objects 
(with "--full", of course) will report

	error: Packfile .git/objects/pack/pack-06389c21fc3c4312cbc9a4ddde087c907c1a840b.pack SHA1 mismatch with itself

for me (the fsck then completes with no other errors what-so-ever, so the 
contents are actually fine).

Or is it just me?

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-18 19:26         ` Linus Torvalds
@ 2006-06-18 21:40           ` Martin Langhoff
  2006-06-18 22:36             ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Martin Langhoff @ 2006-06-18 21:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jon Smirl, git

On 6/19/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Or is it just me?

No problems here with my latest import run. fsck-objects --full comes
clean, takes 14m:

/usr/bin/time git-fsck-objects --full
737.22user 38.79system 14:09.40elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (20807major+19483471minor)pagefaults 0swaps

BTW, that import (with the latest code Junio has) took 37hs even with
the aggressive repack -a -d. I want to bench it dropping the -a from
the recurrring repack, and doing a final repack -a -d.

cheers,


martin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-18 21:40           ` Martin Langhoff
@ 2006-06-18 22:36             ` Linus Torvalds
  2006-06-18 22:51               ` Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-18 22:36 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Nicolas Pitre, Jon Smirl, git

On Mon, 19 Jun 2006, Martin Langhoff wrote:
> 
> No problems here with my latest import run. fsck-objects --full comes
> clean, takes 14m:
>
> /usr/bin/time git-fsck-objects --full
> 737.22user 38.79system 14:09.40elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (20807major+19483471minor)pagefaults 0swaps

It takes much less than that for me: 

	408.40user 32.56system 7:22.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
	0inputs+0outputs (145major+13455672minor)pagefaults 0swaps

and in particular note the much lower minor pagefaults number (which is a 
very good approximation of total RSS). Mine is with all the memory 
optimizations in place, but I didn't see _that_ big of a difference, so 
there's something else in addition.

However, the fact that I get "SHA1 mismatch with itself" is strange. The 
re-pack will always re-generate the SHA1, so I worry that this is perhaps 
some PPC-specific bug in SHA1 handling (and it's entirely possible that 
it's triggered by doing a SHA1 over a 500+MB area).

The fact that you don't see it is indicative that it's somehow specific to 
my setup.

> BTW, that import (with the latest code Junio has) took 37hs even with
> the aggressive repack -a -d. I want to bench it dropping the -a from
> the recurrring repack, and doing a final repack -a -d.

Yeah, that's probably the right thing to do. The "-a" is ok with tons of 
memory, and I'm trying to make it ok with _less_ memory, but it's probably 
just not worth it.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Broken PPC sha1.. (Re: Figured out how to get Mozilla into git)
  2006-06-18 22:36             ` Linus Torvalds
@ 2006-06-18 22:51               ` Linus Torvalds
  2006-06-18 23:25                 ` [PATCH] Fix PPC SHA1 routine for large input buffers Paul Mackerras
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2006-06-18 22:51 UTC (permalink / raw)
  To: Martin Langhoff, Paul Mackerras; +Cc: Nicolas Pitre, Jon Smirl, git

On Sun, 18 Jun 2006, Linus Torvalds wrote:
> 
> On Mon, 19 Jun 2006, Martin Langhoff wrote:
> > 
> > No problems here with my latest import run. fsck-objects --full comes
> > clean, takes 14m:
> >
> > /usr/bin/time git-fsck-objects --full
> > 737.22user 38.79system 14:09.40elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
> > 0inputs+0outputs (20807major+19483471minor)pagefaults 0swaps
> 
> It takes much less than that for me: 
> 
> 	408.40user 32.56system 7:22.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 	0inputs+0outputs (145major+13455672minor)pagefaults 0swaps

Ok, re-building the thing with MOZILLA_SHA1=1 rather than my default 
PPC_SHA1=1 fixes the problem. I no longer get that "SHA1 mismatch with 
itself" on the pack-file.

Sadly, it also takes a _lot_ longer to fsck.

Paul - I think the ppc SHA1_Update() overflows in 32 bits, when the length 
of the memory area to be checksummed is huge.

In particular, the pack-file is 535MB in size, and the way we check the 
SHA1 checksum is by just mapping it all, doing a single SHA1_Update() over 
the whole pack-file, and comparing the end result with the internal SHA1 
at the end of the pack-file.

The PPC SHA1_Update() function starts off with:

	int SHA1_Update(SHA_CTX *c, const void *ptr, unsigned long n)
	{
	...
		c->len += n << 3;

which will obviously overflow if "n" is bigger than 29 bits, ie 512MB.

So doing the length in bits (or whatever that "<<3" is there for) doesn't 
seem to be such a great idea.

I guess we could make the caller just always chunk it up, but wouldn't it 
be nice to fix the PPC SHA1 implementation instead?

That said, the _only_ thing this will ever trigger on in practice is 
exactly this one case: a large packfile whose checksum was _correctly_ 
generated - because pack-file generation does it in IO chunks using the 
csum-file interfaces - but that will be incorrectly checked because we 
check it all at once.

So as bugs go, it's a fairly benign one.

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH] Fix PPC SHA1 routine for large input buffers
  2006-06-18 22:51               ` Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) Linus Torvalds
@ 2006-06-18 23:25                 ` Paul Mackerras
  2006-06-19  5:02                   ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Paul Mackerras @ 2006-06-18 23:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, Nicolas Pitre, Jon Smirl, git

The PPC SHA1 routine had an overflow which meant that it gave
incorrect results for input buffers >= 512MB.  This fixes it by
ensuring that the update of the total length in bits is done using
64-bit arithmetic.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
Linus Torvalds writes:

> Paul - I think the ppc SHA1_Update() overflows in 32 bits, when the length 
> of the memory area to be checksummed is huge.

Yep.  I checked the assembly output of this, and it looks right, but I
haven't actually tested it by running it...

Paul.

diff --git a/ppc/sha1.c b/ppc/sha1.c
index 5ba4fc5..0820398 100644
--- a/ppc/sha1.c
+++ b/ppc/sha1.c
@@ -30,7 +30,7 @@ int SHA1_Update(SHA_CTX *c, const void *
 	unsigned long nb;
 	const unsigned char *p = ptr;
 
-	c->len += n << 3;
+	c->len += (uint64_t) n << 3;
 	while (n != 0) {
 		if (c->cnt || n < 64) {
 			nb = 64 - c->cnt;

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH] Fix PPC SHA1 routine for large input buffers
  2006-06-18 23:25                 ` [PATCH] Fix PPC SHA1 routine for large input buffers Paul Mackerras
@ 2006-06-19  5:02                   ` Linus Torvalds
  0 siblings, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2006-06-19  5:02 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Martin Langhoff, Nicolas Pitre, Jon Smirl, git

On Mon, 19 Jun 2006, Paul Mackerras wrote:
>
> The PPC SHA1 routine had an overflow which meant that it gave
> incorrect results for input buffers >= 512MB.  This fixes it by
> ensuring that the update of the total length in bits is done using
> 64-bit arithmetic.
> 
> Signed-off-by: Paul Mackerras <paulus@samba.org>

Acked-by: Linus Torvalds <torvalds@osdl.org>

This fixes git-fsck-objects for me on the mozilla archive, no more 
complaints about bad SHA1's.

And yeah, now it's taking me 14 minutes too, so the 7-minute fsck was just 
because it didn't actually check the SHA1 of the large pack fully.

(Which is actually good news - half of the time is literally checking the 
pack integrity. That implies that the individual object integrity isn't as 
dominating as I thought it would be, and that things like hw-accelerated 
SHA1 engines will help with fsck. I'd not be surprised to see things like 
that in a couple of years).

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Figured out how to get Mozilla into git
  2006-06-09  2:17 Figured out how to get Mozilla into git Jon Smirl
  2006-06-09  2:56 ` Nicolas Pitre
  2006-06-09  3:06 ` Martin Langhoff
@ 2006-06-09  3:12 ` Pavel Roskin
  2 siblings, 0 replies; 69+ messages in thread
From: Pavel Roskin @ 2006-06-09  3:12 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Hi Jon,

Quoting Jon Smirl <jonsmirl@gmail.com>:

> I was able to import Mozilla into SVN without problem, it just occured
> to me to then import the SVN repository in git.

I feel bad that I didn't suggest it before.  That's quite expected.  Subversion
was created by  CVS developers with the intention of replacing CVS.  cvs2svn
was written by the same CVS developers, who paid attention to all CVS quirks. 
cvs2svn is quite mature and it has a testsuite, if I remember correctly.

My concern is how well a Subversion repository can be mapped to git considering
that Subversion is branch agnostic.  But if it works for Mozilla, this approach
could be recommended for anything big and serious.

> The import has been
> running a few hours now and it is up to the year 2000 (starts in
> 1998). Since I haven't hit any errors yet it will probably finish ok.
> I should have the results in the morning. I wonder how long it will
> take to start gitk on a 10GB repository.

That's the "raison d'etre" of qgit.  I don't know if gitk has anything that qgit
doesn't, except bisecting.

> Once I get this monster into git, are there tools that will let me
> keep it in sync with Mozilla CVS?

Ideally, make Mozilla developers use git :-)

> SVN renamed numeric branches to this form, unlabeled-3.7.24, so that
> may be a problem.

I think git-svn is supposed to do the svn->git part, but I'm afraid it will need
some work to do it effectively.  Google search for "cvs2svn incremental" brings
some patches.  cvsup can be used to synchronize the CVS repository.

--
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Broken PPC sha1.. (Re: Figured out how to get Mozilla into git)
@ 2006-06-19  8:41 linux
  2006-06-19  8:50 ` Johannes Schindelin
  0 siblings, 1 reply; 69+ messages in thread
From: linux @ 2006-06-19  8:41 UTC (permalink / raw)
  To: git, torvalds; +Cc: linux

By the way, if anyone's still interested, I tried to produce a
better-scheduled PowerPC sha1 core last year.  Unfortunately, I don't have
access to a PowerPC machine to test on, so debugging is a little painful.

The latest version is appended, if anyone with a PowerPC machine wants to
try it out.  It should drop in for ppc/sha1ppc.S.  Hopefully the comments
explain the general idea.

Notes I should add to the comments:
- When reading the assembly code, note that PPC assembly permits a bare
  number (no %r prefix) as a register number, and further that number can
  be an expression!  It makes the register renaming nice and simple.

- Also for folks unfamiliar, it's dest,src,src operand order.

- For a reminder, the PowerPC calling convention is:
  %r0 - Temp.  Always reads as zero in some contexts.
  %r1 - stack pointer
  %r2 - Confusing.  Different documents say different things.
  %r3..%r10 - Incoming arguments.  Volatile across function calls.
  %r11..%r12 - Have some special uses not relevant here.  Volatile.
  %r13..%r31 - Callee-save registers.
  %lr, %ctr - Volatile.  %lr holds return address on input.

  And the way registers are used in this function are:
  %r0 - Temp.
  %r1 - Stack pointer.  Used only for register saving.
  %r2 - Not used.
  %r3 - Points to hash accumulator A..E in memory.
  %r4 - Points to data being hashed.
  %r5 - Incoming loop count.  Holds round constant K in body of loop.
  %r6..%r10 - Working copies of A..E
  %r11..%r26 - The W[] array of 16 input words being hashed
  %r27..%r31 - Start-of-round copies of A..E.
  %ctr - Holds loop count, copied from incoming %r5
  %lr - Holds return address.  Not modified.

- While I try to use the load/store multiple instructions where
  appropriate, they have a severe penalty for unaligned operands
  (they're microcoded optimistically, so do a full failing aligned
  load before being re-issued as a slow-but-safe unaligned sequence),
  and thanks to git's object type prefix, the source data is generally
  unaligned, so they're deliberately NOT used to load the 16 words of
  data hashed each iteration.

/*
 * SHA-1 implementation for PowerPC.
 *
 * Copyright (C) 2005 Paul Mackerras <paulus@samba.org>
 */

/*
 * We roll the registers for A, B, C, D, E around on each
 * iteration; E on iteration t is D on iteration t+1, and so on.
 * We use registers 6 - 10 for this.  (Registers 27 - 31 hold
 * the previous values.)
 */
#define RA(t)	(((t)+4)%5+6)
#define RB(t)	(((t)+3)%5+6)
#define RC(t)	(((t)+2)%5+6)
#define RD(t)	(((t)+1)%5+6)
#define RE(t)	(((t)+0)%5+6)

/* We use registers 11 - 26 for the W values */
#define W(t)	((t)%16+11)

/* Register 5 is used for the constant k */

/*
 * There are three F functions, used four groups of 20:
 * - 20 rounds of f0(b,c,d) = "bit wise b ? c : d" =  (^b & d) + (b & c)
 * - 20 rounds of f1(b,c,d) = b^c^d = (b^d)^c
 * - 20 rounds of f2(b,c,d) = majority(b,c,d) = (b&d) + ((b^d)&c)
 * - 20 more rounds of f1(b,c,d)
 *
 * These are all scheduled for near-optimal performance on a G4.
 * The G4 is a 3-issue out-of-order machine with 3 ALUs, but it can only
 * *consider* starting the oldest 3 instructions per cycle.  So to get
 * maximum performace out of it, you have to treat it as an in-order
 * machine.  Which means interleaving the computation round t with the
 * computation of W[t+4].
 *
 * The first 16 rounds use W values loaded directly from memory, while the
 * remianing 64 use values computed from those first 16.  We preload
 * 4 values before starting, so there are three kinds of rounds:
 * - The first 12 (all f0) also load the W values from memory.
 * - The next 64 compute W(i+4) in parallel. 8*f0, 20*f1, 20*f2, 16*f1.
 * - The last 4 (all f1) do not do anything with W.
 *
 * Therefore, we have 6 different round functions:
 * STEPD0_LOAD(t,s) - Perform round t and load W(s).  s < 16
 * STEPD0_UPDATE(t,s) - Perform round t and compute W(s).  s >= 16.
 * STEPD1_UPDATE(t,s)
 * STEPD2_UPDATE(t,s)
 * STEPD1(t) - Perform round t with no load or update.
 * 
 * The G5 is more fully out-of-order, and can find the parallelism
 * by itself.  The big limit is that it has a 2-cycle ALU latency, so
 * even though it's 2-way, the code has to be scheduled as if it's
 * 4-way, which can be a limit.  To help it, we try to schedule the
 * read of RA(t) as late as possible so it doesn't stall waiting for
 * the previous round's RE(t-1), and we try to rotate RB(t) as early
 * as possible while reading RC(t) (= RB(t-1)) as late as possible.
 */


/* the initial loads. */
#define LOADW(s) \
	lwz	W(s),(s)*4(%r4)

/*
 * This is actually 13 instructions, which is an awkward fit,
 * and uses W(s) as a temporary before loading it.
 */
#define STEPD0_LOAD(t,s) \
add RE(t),RE(t),W(t); andc   %r0,RD(t),RB(t);  /* spare slot */        \
add RE(t),RE(t),%r0;  and    W(s),RC(t),RB(t); rotlwi %r0,RA(t),5;     \
add RE(t),RE(t),W(s); add    %r0,%r0,%r5;      rotlwi RB(t),RB(t),30;  \
add RE(t),RE(t),%r0;  lwz    W(s),(s)*4(%r4);

/*
 * This can execute starting with 2 out of 3 possible moduli, so it
 * does 2 rounds in 9 cycles, 4.5 cycles/round.
 */
#define STEPD0_UPDATE(t,s) \
add RE(t),RE(t),W(t); andc   %r0,RD(t),RB(t); xor    W(s),W((s)-16),W((s)-3); \
add RE(t),RE(t),%r0;  and    %r0,RC(t),RB(t); xor    W(s),W(s),W((s)-8);      \
add RE(t),RE(t),%r0;  rotlwi %r0,RA(t),5;     xor    W(s),W(s),W((s)-14);     \
add RE(t),RE(t),%r5;  rotlwi RB(t),RB(t),30;  rotlwi W(s),W(s),1;             \
add RE(t),RE(t),%r0;

/* Nicely optimal.  Conveniently, also the most common. */
#define STEPD1_UPDATE(t,s) \
add RE(t),RE(t),W(t); xor    %r0,RD(t),RB(t); xor    W(s),W((s)-16),W((s)-3); \
add RE(t),RE(t),%r5;  xor    %r0,%r0,RC(t);   xor    W(s),W(s),W((s)-8);      \
add RE(t),RE(t),%r0;  rotlwi %r0,RA(t),5;     xor    W(s),W(s),W((s)-14);     \
add RE(t),RE(t),%r0;  rotlwi RB(t),RB(t),30;  rotlwi W(s),W(s),1;

/*
 * The naked version, no UPDATE, for the last 4 rounds.  3 cycles per.
 * We could use W(s) as a temp register, but we don't need it.
 */
#define STEPD1(t) \
/* spare slot */        add   RE(t),RE(t),W(t); xor    %r0,RD(t),RB(t); \
rotlwi RB(t),RB(t),30;  add   RE(t),RE(t),%r5;  xor    %r0,%r0,RC(t);   \
add    RE(t),RE(t),%r0; rotlwi %r0,RA(t),5;     /* idle */              \
add    RE(t),RE(t),%r0;

/* 5 cycles per */
#define STEPD2_UPDATE(t,s) \
add RE(t),RE(t),W(t); and    %r0,RD(t),RB(t); xor    W(s),W((s)-16),W((s)-3); \
add RE(t),RE(t),%r0;  xor    %r0,RD(t),RB(t); xor    W(s),W(s),W((s)-8);      \
add RE(t),RE(t),%r5;  and    %r0,%r0,RC(t);   xor    W(s),W(s),W((s)-14);     \
add RE(t),RE(t),%r0;  rotlwi %r0,RA(t),5;     rotlwi W(s),W(s),1;             \
add RE(t),RE(t),%r0;  rotlwi RB(t),RB(t),30;

#define STEP0_LOAD4(t,s)		\
	STEPD0_LOAD(t,s);		\
	STEPD0_LOAD((t+1),(s)+1);	\
	STEPD0_LOAD((t)+2,(s)+2);	\
	STEPD0_LOAD((t)+3,(s)+3);

#define STEPUP4(fn, t, s)		\
	STEP##fn##_UPDATE(t,s);		\
	STEP##fn##_UPDATE((t)+1,(s)+1);	\
	STEP##fn##_UPDATE((t)+2,(s)+2);	\
	STEP##fn##_UPDATE((t)+3,(s)+3);	\

#define STEPUP20(fn, t, s)		\
	STEPUP4(fn, t, s);		\
	STEPUP4(fn, (t)+4, (s)+4);	\
	STEPUP4(fn, (t)+8, (s)+8);	\
	STEPUP4(fn, (t)+12, (s)+12);	\
	STEPUP4(fn, (t)+16, (s)+16)

	.globl	sha1_core
sha1_core:
	stwu	%r1,-80(%r1)
	stmw	%r13,4(%r1)

	/* Load up A - E */
	lmw	%r27,0(%r3)

	mtctr	%r5

1:
	lis	%r5,0x5a82	/* K0-19 */
	mr	RA(0),%r27
	LOADW(0)
	mr	RB(0),%r28
	LOADW(1)
	mr	RC(0),%r29
	LOADW(2)
	ori	%r5,%r5,0x7999
	mr	RD(0),%r30
	LOADW(3)
	mr	RE(0),%r31

	STEP0_LOAD4(0, 4)
	STEP0_LOAD4(4, 8)
	STEP0_LOAD4(8, 12)
	STEPUP4(D0, 12, 16)
	STEPUP4(D0, 16, 20)

	lis	%r5,0x6ed9	/* K20-39 */
	ori	%r5,%r5,0xeba1
	STEPUP20(D1, 20, 24)

	lis	%r5,0x8f1b	/* K40-59 */
	ori	%r5,%r5,0xbcdc
	STEPUP20(D2, 40, 44)

	lis	%r5,0xca62	/* K60-79 */
	ori	%r5,%r5,0xc1d6
	STEPUP4(D1, 60, 64)
	STEPUP4(D1, 64, 68)
	STEPUP4(D1, 68, 72)
	STEPUP4(D1, 72, 76)
	STEPD1(76)
	STEPD1(77)
	STEPD1(78)
	STEPD1(79)

	/* Add results to original values */
	add	%r31,%r31,RE(0)
	add	%r30,%r30,RD(0)
	add	%r29,%r29,RC(0)
	add	%r28,%r28,RB(0)
	add	%r27,%r27,RA(0)

	addi	%r4,%r4,64
	bdnz	1b

	/* Save final hash, restore registers, and return */
	stmw	%r27,0(%r3)
	lmw	%r13,4(%r1)
	addi	%r1,%r1,80
	blr

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Broken PPC sha1.. (Re: Figured out how to get Mozilla into git)
  2006-06-19  8:41 Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) linux
@ 2006-06-19  8:50 ` Johannes Schindelin
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Schindelin @ 2006-06-19  8:50 UTC (permalink / raw)
  To: linux; +Cc: git, torvalds

Hi,

On Mon, 19 Jun 2006, linux@horizon.com wrote:

> By the way, if anyone's still interested, I tried to produce a
> better-scheduled PowerPC sha1 core last year.  Unfortunately, I don't have
> access to a PowerPC machine to test on, so debugging is a little painful.

If you have access to SourceForge's compile farm, they have a PPC-G5 
there.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2006-06-19  8:50 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-09  2:17 Figured out how to get Mozilla into git Jon Smirl
2006-06-09  2:56 ` Nicolas Pitre
2006-06-09  3:06 ` Martin Langhoff
2006-06-09  3:28   ` Jon Smirl
2006-06-09  7:17     ` Jakub Narebski
2006-06-09 15:01       ` Linus Torvalds
2006-06-09 16:11         ` Nicolas Pitre
2006-06-09 16:30           ` Linus Torvalds
2006-06-09 17:38             ` Nicolas Pitre
2006-06-09 17:49               ` Linus Torvalds
2006-06-09 17:10           ` Jakub Narebski
2006-06-09 18:13   ` Jon Smirl
2006-06-09 19:00     ` Linus Torvalds
2006-06-09 20:17       ` Jon Smirl
2006-06-09 20:40         ` Linus Torvalds
2006-06-09 20:56           ` Jon Smirl
2006-06-09 21:57             ` Linus Torvalds
2006-06-09 22:17               ` Linus Torvalds
2006-06-09 23:16               ` Greg KH
2006-06-09 23:37               ` Martin Langhoff
2006-06-09 23:43                 ` Linus Torvalds
2006-06-10  0:00                   ` Jon Smirl
2006-06-10  0:11                     ` Linus Torvalds
2006-06-10  0:16                       ` Jon Smirl
2006-06-10  0:45                         ` Jon Smirl
2006-06-09 20:44         ` Jakub Narebski
2006-06-09 21:05         ` Nicolas Pitre
2006-06-09 21:46           ` Jon Smirl
2006-06-10  1:23         ` Martin Langhoff
2006-06-10  1:14   ` Martin Langhoff
2006-06-10  1:33     ` Linus Torvalds
2006-06-10  1:43       ` Linus Torvalds
2006-06-10  1:48         ` Jon Smirl
2006-06-10  1:59           ` Linus Torvalds
2006-06-10  2:21             ` Jon Smirl
2006-06-10  2:34               ` Carl Worth
2006-06-10  3:08                 ` Linus Torvalds
2006-06-10  8:21                   ` Jakub Narebski
2006-06-10  9:00                     ` Junio C Hamano
2006-06-10  8:36                   ` Rogan Dawes
2006-06-10  9:08                     ` Junio C Hamano
2006-06-10 14:47                       ` Rogan Dawes
2006-06-10 14:58                         ` Jakub Narebski
2006-06-10 15:14                         ` Nicolas Pitre
2006-06-10 17:53                     ` Linus Torvalds
2006-06-10 18:02                       ` Jon Smirl
2006-06-10 18:36                       ` Rogan Dawes
2006-06-10  3:01               ` Linus Torvalds
2006-06-10  2:30             ` Jon Smirl
2006-06-10  3:41             ` Martin Langhoff
2006-06-10  3:55               ` Junio C Hamano
2006-06-10  4:02               ` Linus Torvalds
2006-06-10  4:11                 ` Linus Torvalds
2006-06-10  6:02                   ` Jon Smirl
2006-06-10  6:15                     ` Junio C Hamano
2006-06-10 15:44                       ` Jon Smirl
2006-06-10 16:15                         ` Timo Hirvonen
2006-06-10 18:37                         ` Petr Baudis
2006-06-10 18:55                         ` Lars Johannsen
2006-06-11 22:00       ` Nicolas Pitre
2006-06-18 19:26         ` Linus Torvalds
2006-06-18 21:40           ` Martin Langhoff
2006-06-18 22:36             ` Linus Torvalds
2006-06-18 22:51               ` Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) Linus Torvalds
2006-06-18 23:25                 ` [PATCH] Fix PPC SHA1 routine for large input buffers Paul Mackerras
2006-06-19  5:02                   ` Linus Torvalds
2006-06-09  3:12 ` Figured out how to get Mozilla into git Pavel Roskin
  -- strict thread matches above, loose matches on Subject: below --
2006-06-19  8:41 Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) linux
2006-06-19  8:50 ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).