Git development
 help / color / mirror / Atom feed
* Re: 2.6.17-rc6-mm2
From: Goo GGooo @ 2006-06-16  5:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, git
In-Reply-To: <Pine.LNX.4.64.0606151937360.5498@g5.osdl.org>

On 6/16/06, Linus Torvalds <torvalds@osdl.org> wrote:

> So to recap:
>  - http is fundamentally weaker, and needs some server-side help to work
>  - rsync is fine for the initial clone, but doesn't actually know what
>    it's doing, so the end result can actually even be a corrupted
>    repository, because you happened to rsync just as it was updating.
>  - the native git protocol generally should be considered the golden
>    standard, where the other ones are just fallbacks in case of problems
>    (like firewalls that don't let git:// through, or more commonly hosted
>    servers that don't do the git protocol at all).
>
> Which hopefully clarifies the issue a bit.

Thanks for explanation. Unfortunately I can't use git:// with "git
pull" (at least in git-1.3.2). First it does some traffic, that
suddenly stops - I guess the server starts doing *something*, perhaps
preparing the update for me or whatnot. After a pretty long while it
sends some more data but in the meanwhile my ADSL router dropped the
NAT entry and git sits on my side waiting for data forever. Recently I
tried the same on a system with direct Inet connection and that worked
just fine.

I suggest adding SO_KEEPALIVE option on the git socket.

Goo

^ permalink raw reply

* Re: Security problem
From: Linus Torvalds @ 2006-06-16  6:27 UTC (permalink / raw)
  To: Alexander Litvinov; +Cc: Junio C Hamano, git
In-Reply-To: <200606161237.21997.lan@academsoft.ru>



On Fri, 16 Jun 2006, Alexander Litvinov wrote:
>
> > Well, they may not be "safe" - you just need to work a _lot_ harder to
> > corrupt a pack-file in any interesting manner. And again, git-fsck-objects
> > would pick up any such thing going on.
>
> As it shown in pack-objects.c, each object have stored sha1, almost the same 
> as file rename.

Yes and no.

The index file has the stored sha1 (and in that sense you can do almost 
the same thing as a file rename by just modifying the index file).

But when we actually transfer a pack over from one place to another (ie a 
clone or a push), we don't even transfer the index file. Instead, the 
index file gets re-generated at the other end.

That's pretty much an on-going theme in most of git - trying to avoid 
having metadata, if that can instead of calculated directly.

So again, a "rsync" or a "http" thing that just gets the index and 
pack-files directly _as_files_, will actually also download a corrupt 
file. The git native protocol is much harder to fool.

git-fsck-objects actually verifies the pack-files and index files in 
several ways:

 - both the pack-file and the index-file actually contain a SHA1 checksum 
   of themselves, so any accidental corruption will be picked up (but if 
   somebody is able to get at the filesystem, they can obviously 
   re-calculate the SHA1 and update the checksum too)

 - the index file also contains the SHA-1 of the pack-file (and that is 
   then part of the checksum of the index file), again to avoid accidental 
   corruption or mixing of index and pack-files.

 - fsck checks all of these internal SHA-1 checksums, and verifies basic 
   information (ie number of objects must match etc)

 - each object in the index file is unpacked, and its SHA-1 is 
   re-calculated and checked against what the index file claimed.

So exactly as with individual objects, the pack-files are actually 
verified, and on (native-mode) transfer, the names of individual files are 
never actually transferred, rather they are re-calculated from the raw 
contents at the receiving end.

The pack-files then have a few additional sanity-checks of their own that 
should help pinpoint at least the accidental kind of corruption.

But no, the SHA1 checksums of the pack-files are not checked by normal 
operations. That would be deadly - trying to check the SHA1 hash of a 
pack-file obviously would involve reading it all in, something normal 
operations actually try to avoid (normal ops use the index exactly in 
order to only read the parts they need).

Perhaps most importantly, after fsck has checked the SHA-1's of each 
individual object, it will also do a full reachability check. That, in 
many ways, is even more important than checking that each object name 
matches its contents (ie there's no missing history either, and the 
"tips" of the repository end up basically validating all the rest).

So again, the thing is set up so that doing a full fsck actually does a 
_lot_ of integrity checking.

But in the absense of explicit fsck, we do trust the data, even if the 
actual _transfer_ of data will recalculate SHA-1's.

> >  - if you corrupt the repository, subsequent clones (or even pulls) from
> >    the corrupt repository simply won't work if you use the native
> >    protocol, because the native protocol doesn't actually trust anything
> >    but the actual contents (so if the contents won't match, then neither
> >    will the SHA1 names). So the corruption is pretty strictly limited to
> >    the _one_ repository that the attacker had write access to.
>
> As I understand sent pack file will contains actial SHA-1 of objects. And any 
> hack will be cleary visible.

No, as mentioned, the actual SHA-1's won't ever be sent, so what happens 
is that if the repository on the sending side was hacked, the _sending_ 
side may never even realize it (since it's not necessarily checking the 
SHA-1's), but the receiving side will only ever see the raw data, and as 
such, it won't ever even _see_ the "false hidden names", because it will 
generate a whole new index that purely depends on the data.

And maybe that's exactly what you meant - yes, the hack will be clearly 
visible, because the names will now be the "real" ones. You can't hide 
things by using a false name.

> >    So there's a pretty fundamental "corruption containment" part there.
> ...
> Situation with evil repo is clear to me: you can turst only to trusted commit 
> identified by SHA-1

Yes. Exactly.

And once you have a reason to trust a commit, everything you can reach 
from that commit is also trustworthy, assuming it passes fsck. IOW, you 
only really need to trust the head(s) in your repository.

> > But yeah, I actually still personally do a fair number of
> > "git-fsck-objects". I've never found anything that way since very early on
> > (and back then, the real problem was rsync getting objects that weren't
> > reachable), but I still do it. It makes me feel happier.
>
> As the result: Always fsck repo after pull/clone !

Well, even better, try to avoid pulling from untrusted sources in the 
first place ;)

But yes, fsck is actually fairly fast if you do incremental pulls and 
repack your repository. To help you do this, there's two modes to fsck: 
there's the "full mode", which goes through _everything_, including 
pack-files, and there's the "fsck only lose objects", which is the common 
one.

So for example, let's say that you only ever repack your repository 
locally when it's been "known good" (in fact, repacking in itself will 
generally find almost all of the problems that fsck can find, since a full 
repack will obviously do the reachability analysis as part of just the 
preparatory work). That means that you only ever need to do the quick 
default "light fsck" after a pull, since an incremental pull (with the 
native protocol) will have unpacked all the pulled objects.

So "fsck after each pull" is not something we do by default, but if you 
keep your repo fairly packed, doing so manually (or by just scripting 
things) won't even really slow you down, because it will only ever need to 
check incrementally - the stuff you've re-packed it doesn't need to check 
(assuming you can now trust your local filesystem).

So git certainly gives you the option to be really anal, and doesn't even 
make it needlessly hard or expensive, even with large repositories.

			Linus

^ permalink raw reply

* Re: 2.6.17-rc6-mm2
From: Linus Torvalds @ 2006-06-16  6:39 UTC (permalink / raw)
  To: Goo GGooo; +Cc: linux-kernel, git
In-Reply-To: <ef5305790606152249n2702873fy7b708d9c47c78470@mail.gmail.com>



On Fri, 16 Jun 2006, Goo GGooo wrote:
> 
> Thanks for explanation. Unfortunately I can't use git:// with "git
> pull" (at least in git-1.3.2). First it does some traffic, that
> suddenly stops - I guess the server starts doing *something*, perhaps
> preparing the update for me or whatnot.

Yeah, for a big pull, the server will have to think about the objects it 
is going to send you.

> I suggest adding SO_KEEPALIVE option on the git socket.

Actually, the really irritating thing is that we actually generate all 
these nice status updates, which just makes pulling and cloning a lot more 
comfortable, because you actually see what is going on, and what to 
expect. 

Except they only work over ssh, where we have a separate channel (for 
stderr), and with the native git protocol all that nice status work just 
gets flushed to /dev/null :(

Dang. It's literally the most irritating part of the thing: the protocol 
itself is exactly the same whether you go over ssh:// or over git://, but 
that visual information about what is going on is missing, and it's 
surprisingly important from a usability standpoint.

And in your case, the usability downside actually turned into a real 
accessibility bug.

Oh, well.

		Linus

^ permalink raw reply

* Re: Autoconf/Automake
From: Nikolai Weibull @ 2006-06-16  6:51 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Yann Dirson, Alex Riesen, Pavel Roskin, git
In-Reply-To: <Pine.LNX.4.63.0606160105100.7480@wbgn013.biozentrum.uni-wuerzburg.de>

On 6/16/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:

> In a project I am stuck in, maven is used. It tries -- of all things -- to
> fix a few shortcomings of ant -- which was supposed to fix shortcomings of
> make! And let's face it. Maven is complicated, slow as a dog lacking all
> four feet, and it still does not do the things I can do in three lines
> with make. It's a complete desaster.

But...it uses XML...how can it not be a panacea?

  nikolai

^ permalink raw reply

* Re: Security problem
From: Alexander Litvinov @ 2006-06-16  8:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0606152300460.5498@g5.osdl.org>

> So git certainly gives you the option to be really anal, and doesn't even
> make it needlessly hard or expensive, even with large repositories.

Thanks for detailed description. Now I can sleep without any worry about my 
repo :-)

^ permalink raw reply

* Re: Autoconf/Automake
From: Jerome Lovy @ 2006-06-16  9:06 UTC (permalink / raw)
  To: git
In-Reply-To: <20060615174833.GA32247@dspnet.fr.eu.org>

Olivier Galibert wrote:
> On Thu, Jun 15, 2006 at 10:02:10AM -0700, Linus Torvalds wrote:
> 
>>These days, there aren't fifteen different versions of UNIX. There's a 
>>couple, and it's perfectly ok to actually say "fix your damn system and 
>>just install GNU make". It's easier to install GNU make than it is to 
>>install autoconf/automake.
> 
> 
> You should be careful to separate autoconf and automake.  Autoconf is
> not so bad, and you can make clean, maintainable Makefile.in and
> config.h.in files with it, because it uses simple substitution.  It is
> quite useful to detect available librairies when some are optional,
> and also to lightly[1] ensure that prefix and friends will stay the
> same between make and make install.  Also, especially if you hack a
> little bit to alias 'enable' and 'with', you get a sane interface to
> optional feature selection.  Oh, and to seperate compilation
> directories too (vpath generation).

I fully agree with Olivier. It seems to me that you don't have to buy 
the whole autoconf/automake/libtool stack to leverage the autoconf 
functionality. autoconf alone provides the full "autoconfiguration" 
framework (running scriptlets and setting substitution variables 
accordingly). You still have to write Makefile.in (with statements 
looking like: CC=@CC@). Therefore the resulting Makefile is just as 
beautiful or as ugly as you wrote the initial Makefile.in: you have full 
control over it.

As for dependencies, one shouldn't confuse what is needed on the 
autoconfiguration developer's side (in order to build the configure 
script from the configure.in file) and what is needed on the installer's 
side to run the configure script and process the generated makefile. The 
former needs the autoconf package which itself relies on GNU m4. The 
latter merely needs a decently compatible Bourne shell and a decently 
compatible make.

On the other hand, what you get with automake is a fully automatically 
generated makefile, with make targets conforming to the GNU standards. 
But then you fully loose control over the Makefile: you don't write the 
Makefile.in anymore (automake does it for you) but rather the terce 
Makefile.am. In this respect, automake is like imake: you write few 
lines of (i)makefile, but then you cannot complain if you don't 
understand what comes in the generated makefile ;-) .

Jérôme Lovy

^ permalink raw reply

* Just out I think, yes. Be delighted with
From: Major @ 2006-06-16 10:26 UTC (permalink / raw)
  To: glenn

Hello my friend!
Make your girlfriend or wife speechless with increased hardness, richer orgsms and more power in bed 
Get everything you need delivered to your door low-cost and fast.

 Largest and most recognized brands are working to make you 100% happy with this stuff.
 All you need is here: http://www.extremeci.com
 We thank you for being interested in our products

^ permalink raw reply

* Get the freshest Now you have chance to do it Delight in
From: Brendan @ 2006-06-16 10:31 UTC (permalink / raw)
  To: glenda

Dear member.

Rock hard manhood, multiple explosions and several times more semen volume 
Order now and benefit from lowest costs and convenient shipment
 Hot deals on stuff produced by well-known brands from worldwide.

 Find what you need: http://www.extremeci.com

 The prices are really low and the quality it truly very high!

^ permalink raw reply

* Re: [BUG] stgit branch renaming into new dir crashes
From: Catalin Marinas @ 2006-06-16 12:06 UTC (permalink / raw)
  To: Yann Dirson; +Cc: GIT list
In-Reply-To: <20060613214053.GD7766@nowhere.earth>

On 13/06/06, Yann Dirson <ydirson@altern.org> wrote:
> When trying to rename a branch to a name including a slash, there is
> no explicit creation of leading dirs, and stgit crashes:
>
> $ stg branch -r multitag dev/multitag
> Traceback (most recent call last):
[...]

What version of StGIT are you using? It seems to be OK with 0.10.

-- 
Catalin

^ permalink raw reply

* Re: 2.6.17-rc6-mm2
From: Uwe Zeisberger @ 2006-06-16 12:40 UTC (permalink / raw)
  To: git
In-Reply-To: <ef5305790606152249n2702873fy7b708d9c47c78470@mail.gmail.com>

Hello,

> I suggest adding SO_KEEPALIVE option on the git socket.
I suggest to do this "manually", that is send an dummy (or status)
package every x seconds.  Then the server could detect if a cloning
client disconnected and stop generating the pack file.

(Currently I see from time to time a git server process (IIRC
git-pack-objects) that creates a packfile and only when it's done fails
to send it.)

Best regards
Uwe

-- 
Uwe Zeisberger

http://www.google.com/search?q=30+hours+and+4+days+in+seconds

^ permalink raw reply

* Cygwin git and windows network shares
From: Niklas Frykholm @ 2006-06-16 12:58 UTC (permalink / raw)
  To: git

I'm trying to use cygwin git (compiled from the 1.4.0 tarball) to create 
repository on a windows network share, but I get an error message.

    $ cd //computer/git/project
    $ git init-db
    defaulting to local storage area
    Could not rename the lock file?

The repository seems to be left in an inconsistent state after this:

    $ git clone //computer/git/project/
    fatal: no matching remote head
    fetch-pack from '//computer/git/project/.git' failed.

When working only with local files, I do not get these errors. Does 
anyone know the cause of this error/any way around it?

// Niklas

^ permalink raw reply

* Re: Cygwin git and windows network shares
From: Juergen Ruehle @ 2006-06-16 14:24 UTC (permalink / raw)
  To: Niklas Frykholm; +Cc: git
In-Reply-To: <4492AAFA.20807@grin.se>

Niklas Frykholm writes:
 > I'm trying to use cygwin git (compiled from the 1.4.0 tarball) to create 
 > repository on a windows network share, but I get an error message.
 > 
 >     $ cd //computer/git/project
 >     $ git init-db
 >     defaulting to local storage area
 >     Could not rename the lock file?

cygwin's rename seems to be capable of overwriting an existing target
only on NTFS. The following hack is a workaround, but is probably not
safe.

diff --git a/lockfile.c b/lockfile.c
index 2346e0e..5e78211 100644
--- a/lockfile.c
+++ b/lockfile.c
@@ -48,6 +48,7 @@ int commit_lock_file(struct lock_file *l
 	strcpy(result_file, lk->filename);
 	i = strlen(result_file) - 5; /* .lock */
 	result_file[i] = 0;
+	unlink(result_file);
 	i = rename(lk->filename, result_file);
 	lk->filename[0] = 0;
 	return i;

^ permalink raw reply related

* Why so much time in the kernel?
From: Jon Smirl @ 2006-06-16 14:49 UTC (permalink / raw)
  To: git

I'm still working on importing Mozilla CVS. I'm at the phase now where
all of the changeset have been identified. The scripts are pulling the
changesets one at a time out of CVS and putting them into git. I've
been running this phase for 2 days now on a 3GB machine and it still
isn't finished.

I am spending over 40% of the time in the kernel. This looks to be
caused from forks and starting small tasks, is that the correct
interpretation? Is the number of process that have been run recorded
any where? 1.4% of the time is spend in the dynamic linker.

Checking with oprofile I see this:

  18262372 41.0441 /home/good/vmlinux
  5465741 12.2841 /usr/bin/cvs
  4374336  9.8312 /lib/libc-2.4.so
  3627709  8.1532 /lib/libcrypto.so.0.9.8a
  2494610  5.6066 /usr/bin/oprofiled
  2471238  5.5540 /usr/lib/libz.so.1.2.3
   945349  2.1246 /usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libperl.so
   933646  2.0983 /usr/local/bin/git-read-tree
   758776  1.7053 /usr/local/bin/git-write-tree
   642502  1.4440 /lib/ld-2.4.so
   472903  1.0628 /nvidia
   379254  0.8524 /usr/local/bin/git-pack-objects

and breaking down the kernel number:

3467889  18.9893  copy_page_range
2190416  11.9941  unmap_vmas
1156011   6.3300  page_fault
887794    4.8613  release_pages
860853    4.7138  page_remove_rmap
633243    3.4675  get_page_from_freelist
398773    2.1836  do_wp_page
344422    1.8860  __mutex_lock_slowpath
280070    1.5336  __handle_mm_fault
241713    1.3236  do_page_fault
238398    1.3054  __d_lookup
236654    1.2959  vm_normal_page


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: Why so much time in the kernel?
From: Linus Torvalds @ 2006-06-16 15:06 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git
In-Reply-To: <9e4733910606160749t4d7a541ev72a67383e96d86da@mail.gmail.com>



On Fri, 16 Jun 2006, Jon Smirl wrote:
> 
> I am spending over 40% of the time in the kernel. This looks to be
> caused from forks and starting small tasks, is that the correct
> interpretation?

Yes. Your kernel profile is all for stuff related to setting up and 
tearing down process space (well, __mutex_lock_slowpath at 1.88% and 
__d_lookup at 1.3% is not, but every single one before that does seem to 
be about fork/exec/exit).

I think it's both the CVS server that continually forks/exits (it doesn't 
actually do a exec at all - it seem sto be using fork/exit as a way to 
control its memory usage - knowing that the OS will free all the temporary 
memory on exit - I think the newer CVS development trees don't do this, 
but that also seems to be why they leak memory like mad and eventually run 
out ;).

AND it's git-cvsimport forking and exec'ing git helper processes. 

So that process overhead is expected.

What I would _not_ have expected is:

>   933646  2.0983 /usr/local/bin/git-read-tree

I don't see why git-read-tree is so hot for you. We should never need to 
read a tree when we're importing something, unless there are tons of 
branches and we switch back and forth between them.

I guess mozilla really does use a fair number of branches? 

Martin sent out a patch (that I don't think has been merged yet) to avoid 
the git-read-tree overhead when switching branches. Look for an email with 
a subject like "cvsimport: keep one index per branch during import", I 
suspect that would speed up the git part a lot.

(It will also avoid a few fork/exec's, but you'll still have most of them, 
so I don't think you'll see any really _fundamental_ changes to this, but 
the git-read-tree overhead should be basically gone, and some of the 
libz.so pressure would also be gone with it. It should also avoid 
rewriting the index file, so you'd get lower disk pressure, but it looks 
like none of your problems are really due to IO, so again, that probably 
won't make much of a difference for you).

			Linus

^ permalink raw reply

* Re: Why so much time in the kernel?
From: Jon Smirl @ 2006-06-16 15:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0606160755170.5498@g5.osdl.org>

On 6/16/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Fri, 16 Jun 2006, Jon Smirl wrote:
> >
> > I am spending over 40% of the time in the kernel. This looks to be
> > caused from forks and starting small tasks, is that the correct
> > interpretation?
>
> Yes. Your kernel profile is all for stuff related to setting up and
> tearing down process space (well, __mutex_lock_slowpath at 1.88% and
> __d_lookup at 1.3% is not, but every single one before that does seem to
> be about fork/exec/exit).
>
> I think it's both the CVS server that continually forks/exits (it doesn't
> actually do a exec at all - it seem sto be using fork/exit as a way to
> control its memory usage - knowing that the OS will free all the temporary
> memory on exit - I think the newer CVS development trees don't do this,
> but that also seems to be why they leak memory like mad and eventually run
> out ;).

I am using cvs-1.11.21-3.2
I can try running their development tree.

>
> AND it's git-cvsimport forking and exec'ing git helper processes.

Is it worthwhile to make a library version of these? Svn has lib
versions and they barely show up in oprofile. cvsimport is only using
4-5 low level git funtions.

>
> So that process overhead is expected.
>
> What I would _not_ have expected is:
>
> >   933646  2.0983 /usr/local/bin/git-read-tree
>
> I don't see why git-read-tree is so hot for you. We should never need to
> read a tree when we're importing something, unless there are tons of
> branches and we switch back and forth between them.
>
> I guess mozilla really does use a fair number of branches?

Is 1,800 a lot?

>
> Martin sent out a patch (that I don't think has been merged yet) to avoid
> the git-read-tree overhead when switching branches. Look for an email with
> a subject like "cvsimport: keep one index per branch during import", I
> suspect that would speed up the git part a lot.

I'll check this out

> (It will also avoid a few fork/exec's, but you'll still have most of them,
> so I don't think you'll see any really _fundamental_ changes to this, but
> the git-read-tree overhead should be basically gone, and some of the
> libz.so pressure would also be gone with it. It should also avoid
> rewriting the index file, so you'd get lower disk pressure, but it looks
> like none of your problems are really due to IO, so again, that probably
> won't make much of a difference for you).

I have been CPU bound for two days, disk activity is minor.
git-cvsimport is 250MB and I have 2GB of disk cache.

After looking at this process for about a week it doesn't look like
processing chronologically is the best strategy. cvsps can quickly
work out the changesets, 15 minutes. Then it might be better to walk
the CVS files one at a time generating git IDs for each revision. Next
use the IDs and changeset info to build the git trees. Finally pack
everything. This strategy would minimize the work load on the CVS
files (adding all those delta to get random revs).

Can git build a repository in this manner? If this is feasible it may
be possible to do all of this in a single pass over the CVS tree by
modifying cvsps.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: Why so much time in the kernel?
From: Linus Torvalds @ 2006-06-16 16:09 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git
In-Reply-To: <9e4733910606160825hb538d6fo4c9f1d7d9768e100@mail.gmail.com>



On Fri, 16 Jun 2006, Jon Smirl wrote:
>
> I am using cvs-1.11.21-3.2
> I can try running their development tree.

No, don't. We already know that 1.12 leaks memory and makes the cvsimport 
not work at all.

> > 
> > AND it's git-cvsimport forking and exec'ing git helper processes.
> 
> Is it worthwhile to make a library version of these? Svn has lib
> versions and they barely show up in oprofile. cvsimport is only using
> 4-5 low level git funtions.

Eventually, I think that's where we'll get. We're already at the stage 
where most of the core could just be written as a library.

> > I guess mozilla really does use a fair number of branches?
> 
> Is 1,800 a lot?

Yeah. Although even just two is enough, if you just alternate committing 
on them ;)

So it's actually not number of branches, it's more about frequency of 
the branch changing in the cvsps output. And yes, you could probably 
improve performance by sorting the changesets differently, but Martin's 
change to use separate index files should make it all pretty moot.

		Linus

^ permalink raw reply

* Re: Why so much time in the kernel?
From: Jon Smirl @ 2006-06-16 17:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0606160906250.5498@g5.osdl.org>

Is it a crazy idea to read the cvs files, compute an sha1 on each
expanded delta and then write the delta straight into a pack file? Are
the cvs and git delta formats the same? What about CVS's forward and
reverse delta use? While this is going on, track the
branches/changsets in memory and then finish up by writing these trees
into the pack file too. This should take no more ram than cvsps needs
currently.

This leaves the packfile is a non-optimal format but a repack should
fix that, right?

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: Why so much time in the kernel?
From: Jakub Narebski @ 2006-06-16 17:09 UTC (permalink / raw)
  To: git
In-Reply-To: <9e4733910606161000t53328571u10a350eca894ccdc@mail.gmail.com>

Jon Smirl wrote:

> Is it a crazy idea to read the cvs files, compute an sha1 on each
> expanded delta and then write the delta straight into a pack file?

That's what parsecvs does (i.e. read *,v files directly).
See http://git.or.cz/gitwiki/InterfacesFrontendsAndTools

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply

* git-rebase nukes multiline comments
From: Matthias Hopf @ 2006-06-16 17:12 UTC (permalink / raw)
  To: git; +Cc: xorg

Hi all,

I'm using git-1.2.4 on SL10.1, in centralized style development (for X.org).

I wanted to commit a set of changes (4 local commits) upstream, so I had
to do a git-rebase first (in that particular case a git-pull would have
been possible as well, but git-rebase fits the CVS style development
better). After git-fetch, git-rebase origin, and git-push all my changes
had only the first line of the changelog comment, the remainder was
nuked.

To reproduce:

mkdir /var/tmp/blaup
cd /var/tmp/blaup
git-init-db
echo test > foo
git-add foo
git-commit      (any comment)
cd ..
git-clone /var/tmp/blaup bla
cd bla
echo test2 >>foo 
git-commit foo  (multiline comment)
cd ../blaup
echo test3 >bar
git-add bar
git-commit      (any comment)
cd ../bla
git-fetch
git-log         (shows multiline comment for 'test2')
git-rebase origin
git-log         (shows only the first line of the multiline comment!)


I doubt this is intended behavior.


Also, while trying to reproduce this with the original upstream
repository, I would have had to git-fetch my origin branch (upstream
master), but not to get _all_ new commits, but only up to a certain
revspec (the one *before* my own commits).

I tried "git-fetch <refspec>:", but this didn't work, neither did
anything else I tried.  This is clearly beyond my understanding of git,
so how can this be done?

Thanks

Matthias

-- 
Matthias Hopf <mhopf@suse.de>       __        __   __
Maxfeldstr. 5 / 90409 Nuernberg    (_   | |  (_   |__         mat@mshopf.de
Phone +49-911-74053-715            __)  |_|  __)  |__  labs   www.mshopf.de

^ permalink raw reply

* Re: git-rebase nukes multiline comments
From: David Kowis @ 2006-06-16 17:23 UTC (permalink / raw)
  To: git, xorg
In-Reply-To: <20060616171251.GA29820@suse.de>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Matthias Hopf wrote:
> Hi all,
> 
> I'm using git-1.2.4 on SL10.1, in centralized style development (for X.org).
> 
> I wanted to commit a set of changes (4 local commits) upstream, so I had
> to do a git-rebase first (in that particular case a git-pull would have
> been possible as well, but git-rebase fits the CVS style development
> better). After git-fetch, git-rebase origin, and git-push all my changes
> had only the first line of the changelog comment, the remainder was
> nuked.
> 
> To reproduce:
> 
> mkdir /var/tmp/blaup
> cd /var/tmp/blaup
> git-init-db
> echo test > foo
> git-add foo
> git-commit      (any comment)
> cd ..
> git-clone /var/tmp/blaup bla
> cd bla
> echo test2 >>foo 
> git-commit foo  (multiline comment)
> cd ../blaup
> echo test3 >bar
> git-add bar
> git-commit      (any comment)
> cd ../bla
> git-fetch
> git-log         (shows multiline comment for 'test2')
> git-rebase origin
> git-log         (shows only the first line of the multiline comment!)
> 
> 

I'm new to git, but I tried what you said.
my git log:
commit c846bea8c61bec7cf0f7688c48abc42577b9ac7f
Author: David Kowis <dkowis@kain.org>
Date:   Fri Jun 16 12:20:08 2006 -0500

    this is a multi

    line comment
    with three lines


I'm using git 1.4.0. It added a blank line in there...


David Kowis

ISO Team Lead - www.sourcemage.org
Source Mage GNU/Linux

Progress isn't made by early risers. It's made by lazy men trying to
find easier ways to do something.
  - Robert Heinlein
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (MingW32)

iQGVAwUBRJLo+cnf+vRw63ObAQomewv+L18ogJHgx3jQPt/B+K84GIAX5SugrSnZ
ASC2jm/sbMdidU1goOepXILw2DBOWKSpuDwTZXE0uDrldMTK4RW/2dDACbGVEQX/
Ter4cclIxNztaAwzXGHqKyOI24c5jQmlzW+yDcnErJZTexDA6xyp4xVZlySJpZev
tzfj1Di/uYNJ83lcgS9ID64JToZ5sYZjeqy5HjfEpEQR7xHSYoaR94LNjSHMrqU8
S32ryCMeBSX9SWP8lX7lv6YzIlPGYbOVIsskANVN4GyYVdoMXyXpNtDvziIXrxJj
FkSCloMq5bzVuykthPer0FQRXiySyM1bWsUt9i7Xf3fF8qzyVpIJghP3GAlwh4Gs
LRefaUkkVH61FmN+Uw65xxdx99L4ABoZJDpPBhQdOnY+BXbhNGM5p/lAi3iX72Bx
eIMmaWiwxF8XlIaLJFbDVtGA7lwJzneQQUyHHlTZhzu+VXf4ulKPE93NKEuWWqnL
FD9Tgmu5sFANq5iKSCyocvyAqiWljR8w
=hQWx
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: Why so much time in the kernel?
From: Keith Packard @ 2006-06-16 17:29 UTC (permalink / raw)
  To: Jon Smirl; +Cc: keithp, Linus Torvalds, git
In-Reply-To: <9e4733910606161000t53328571u10a350eca894ccdc@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1205 bytes --]

On Fri, 2006-06-16 at 13:00 -0400, Jon Smirl wrote:
> Is it a crazy idea to read the cvs files, compute an sha1 on each
> expanded delta and then write the delta straight into a pack file? Are
> the cvs and git delta formats the same? What about CVS's forward and
> reverse delta use?

At this point, merging blobs into packs isn't a significant part of the
computational cost. parsecvs is spending all of its time in the
quadratic traversal of the diff chains; fixing that to emit all of the
versions in a single pass should speed up that part of the conversion
process dramatically.

>  While this is going on, track the
> branches/changsets in memory and then finish up by writing these trees
> into the pack file too. This should take no more ram than cvsps needs
> currently.

cvsps drops too much state on the floor making branch point and branch
contents inaccurate. What I'm hoping is that I can figure out a way to
discard most of the per-version information by computing tree objects in
reverse order, saving only the tree sha1 and other per-commit info, then
stitch the commits together using that, without needing the full
per-file data.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: Why so much time in the kernel?
From: Jon Smirl @ 2006-06-16 17:44 UTC (permalink / raw)
  To: Keith Packard; +Cc: Linus Torvalds, git
In-Reply-To: <1150478968.6983.7.camel@neko.keithp.com>

[-- Attachment #1: Type: text/plain, Size: 2050 bytes --]

On 6/16/06, Keith Packard <keithp@keithp.com> wrote:
> On Fri, 2006-06-16 at 13:00 -0400, Jon Smirl wrote:
> > Is it a crazy idea to read the cvs files, compute an sha1 on each
> > expanded delta and then write the delta straight into a pack file? Are
> > the cvs and git delta formats the same? What about CVS's forward and
> > reverse delta use?
>
> At this point, merging blobs into packs isn't a significant part of the
> computational cost. parsecvs is spending all of its time in the
> quadratic traversal of the diff chains; fixing that to emit all of the
> versions in a single pass should speed up that part of the conversion
> process dramatically.

That's not true for the state I am in. cvsps can compute the changeset
tree in 15 minutes, cvs2svn can compute their version in a couple of
hours. cvs2svn builds a much better tree.

I've been extracting versions from cvs and adding them to git now for
2.5 days and the process still isn't finished. It is completely CPU
bound. It's just a loop of cvs co, add it to git, make tree, commit,
etc.

> >  While this is going on, track the
> > branches/changsets in memory and then finish up by writing these trees
> > into the pack file too. This should take no more ram than cvsps needs
> > currently.
>
> cvsps drops too much state on the floor making branch point and branch
> contents inaccurate. What I'm hoping is that I can figure out a way to
> discard most of the per-version information by computing tree objects in
> reverse order, saving only the tree sha1 and other per-commit info, then
> stitch the commits together using that, without needing the full
> per-file data.

I agree cvsps is dropping a lot.  My screen is full of "Skipping
#CVSPS_NO_BRANCH" and
"Skipping SpiderMonkey140_NES40Rtm_Branch" and "Skipping
SpiderMonkey140_BRANCH" etc.

What about the cvs2svn algorithm described in the attachment? A ram
based version could be faster. Compression could be acheived by
switching from using the full path to a version to the sha1 for it.

-- 
Jon Smirl
jonsmirl@gmail.com

[-- Attachment #2: design-notes.txt --]
[-- Type: text/plain, Size: 22780 bytes --]

                         How cvs2svn Works
                         =================

A cvs2svn run consists of eight passes.  Each pass saves the data it
produces to files on disk, so that a) we don't hold huge amounts of
state in memory, and b) the conversion process is resumable.

Pass 1:
=======

The goal of this pass is to write to 'cvs2svn-data.revs' a summary of
all the revisions for each RCS file.  Each revision will be
represented by one line.  At the end of this stage, the revisions
(i.e., the lines) will be grouped by RCS file, not by logical commits.

We walk over the repository, processing each RCS file with
rcsparse.parse(), using cvs2svn's CollectData class, which is a
subclass of rcsparse.Sink(), the parser's callback class.  For each
RCS file, the first thing the parser encounters is the administrative
header, including the head revision, the principal branch, symbolic
names, RCS comments, etc.  The main thing that happens here is that
CollectData.define_tag() is invoked on each symbolic name and its
attached revision, so all the tags and branches of this file get
collected.

Next, the parser hits the revision summary section.  That's the part
of the RCS file that looks like this:

   1.6
   date	2002.06.12.04.54.12;	author captnmark;	state Exp;
   branches
   	1.6.2.1;
   next	1.5;

   1.5
   date	2002.05.28.18.02.11;	author captnmark;	state Exp;
   branches;
   next	1.4;

   [...]

For each revision summary, CollectData.define_revision() is invoked,
recording that revision's metadata in various variables of the
CollectData class instance.

After finishing the revision summaries, the parser invokes
CollectData.tree_completed(), which loops over the revision
information stored, determining if there are instances where a higher
revision was committed "before" a lower one (rare, but it can happen
when there was clock skew on the repository machine).  If there are
any, it "resyncs" the timestamp of the earlier rev to be just before
that of the later rev, but saves the original timestamp in
self.rev_data[blah][2], so we can later write out a record to the
resync file indicating that an adjustment was made (this makes it
possible to catch the other parts of this commit and resync them
similarly, more details below).

Next, the parser encounters the *real* revision data, which has the
log messages and file contents.  For each revision, it invokes
CollectData.set_revision_info(), which writes a new line to
cvs2svn-data.revs.  The line is constructed by the CVSRevision class -
one of its many roles. Here is an example:

   3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 3dc32954 3dc32956 C 1.1 1.2 1.3 1 1 1024 N * 0 0 foo/bar,v

The fields are:

   1.  a fixed-width timestamp
   2.  a digest of the log message + author
   3.  a fixed-width timestamp indicating the timestamp of this
       revision's previous revision (or "*", if it's the first
       revision on this line of development).
   4.  a fixed-width timestamp indicating the timestamp of this
       revision's next revision (or "*", if it's the last revision on
       this line of development).
   5.  the type of change ("A"dd, "C"hange, or "D"elete)
   6.  the revision number of the previous revision along this line of
       development (or "*", if it's the first revision on this line of
       development).
   7.  the revision number
   8.  the revision number of the next revision along this line of
       development (or "*", if it's the last revision on this line of
       development).
   9.  1 if the RCS file is in the Attic, "*" if it isn't.
   10. 1 is the RCS file has the executable bit set, "*" if not.
   12. The size of the RCS file, in bytes.
   12. "N" if this revision has non-empty deltatext, else "E" for empty
   13. the RCS keyword substitution mode ("k", "b", etc), or "*" if none
   14. the branch on which this commit happened, or "*" if not on a branch
   15. the number of tags rooted at this revision (followed by their
       names, space-delimited)
   16. the number of branches rooted at this revision (followed by
       their names, space-delimited)
   17. the path of the RCS file in the repository

(Of course, in the above example, fields 15 and 16 are "0", so they have
no additional data.)

Also, for resync'd revisions, a line like this is written out to
'cvs2svn-data.resync':

   3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328

The fields are:

   NEW_TIMESTAMP   DIGEST   OLD_TIMESTAMP

(The resync file will be explained later.)

That's it -- the RCS file is done.

When every RCS file is done, Pass 1 is complete, and:

   - cvs2svn-data.revs contains a summary of every RCS file's
     revisions.  All the revisions for a given RCS file are grouped
     together, but note that the groups are in no particular order.
     In other words, you can't yet identify the commits from looking
     at these lines; a multi-file commit will be scattered all over
     the place.

   - cvs2svn-data.resync contains a small amount of resync data, in
     no particular order.

Pass 2:
=======

This is where the resync file is used.  The goal of this pass is to
convert cvs2svn-data.revs to a new file, 'cvs2svn-data.c-revs' (clean
revs).  It's the same as the original file, except for some resync'd
timestamps.

First, read the whole resync file into a hash table that maps each
author+log digest to a list of lists.  Each sublist represents one of
the timestamp adjustments from Pass 1, and looks like this:

   [old_time_lower, old_time_upper, new_time]

The reason to map each digest to a list of sublists, instead of to one
list, is that sometimes you'll get the same digest for unrelated
commits (for example, the same author commits many times using the
empty log message, or a log message that just says "Doc tweaks.").  So
each digest may need to "fan out" to cover multiple commits, but
without accidentally unifying those commits.

Now we loop over cvs2svn-data.revs, writing each line out to
'cvs2svn-data.c-revs'.  Most lines are written out unchanged, but
those whose digest matches some resync entry, and appear to be part of
the same commit as one of the sublists in that entry, get tweaked.
The tweak is to adjust the commit time of the line to the new_time,
which is taken from the resync hash and results from the adjustment
described in Pass 1.

The way we figure out whether a given line needs to be tweaked is to
loop over all the sublists, seeing if this commit's original time
falls within the old<-->new time range for the current sublist.  If it
does, we tweak the line before writing it out, and then conditionally
adjust the sublist's range to account for the timestamp we just
adjusted (since it could be an outlier).  Note that this could, in
theory, result in separate commits being accidentally unified, since
we might gradually adjust the two sides of the range such that they are
eventually more than COMMIT_THRESHOLD seconds apart.  However, this is
really a case of CVS not recording enough information to disambiguate
the commits; we'd know we have a time range that exceeds the
COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
up.  We could try some clever heuristic, but for now it's not
important -- after all, we're talking about commits that weren't
important enough to have a distinctive log message anyway, so does it
really matter if a couple of them accidentally get unified?  Probably
not.

NOTE: We currently have a fairly major bug in our resync code.  The
resync_bug test demonstrates it.  The bug is that, when resyncing in
pass 2, we take no care not to move cvs revisions before previous
cvs revisions of the same file, thus creating the very problem we were
attempting to avoid.

Pass 3:
=======

This is where we deduce the changesets, that is, the grouping of file
changes into single commits.

It's very simple -- run 'sort' on cvs2svn-data.c-revs, converting it
to 'cvs2svn-data.s-revs'.  Because of the way the data is laid out,
this causes commits with the same digest (that is, the same author and
log message) to be grouped together.  Poof!  We now have the CVS
changes grouped by logical commit.

In some cases, the changes in a given commit may be interleaved with
other commits that went on at the same time, because the sort gives
precedence to date before log digest.  However, Pass 4 detects this by
seeing that the log digest is different, and reseparates the commits.

Pass 4:
=======

This pass has two primary objectives:

1. Create a database that maps CVSRevision unique keys to the actual
   CVSRevision string from the revs file (whose format is described
   above in pass 1).  This results in a database containing one
   key-value pair for each line in the revs file.  This gives us the
   ability to pass around these smaller keys instead of whole CVS
   revisions (which look like lines from the s-revs file).  See the
   CVSRevision class for more details on what the unique key is.

2. Find and create a database containing the last CVS revision that is
   a source (also referred to as an "opening" revision) for all
   symbolic names.  This will result in a database containing
   key-value pairs whose key is the unique key for a CVSRevision, and
   whose value is a list of symbolic names for which that CVSRevision
   is the last "opening."

   The format for this file is:

       cvs-symname-last-revs.db:
            Key                      Value
            CVS Revision             array of Symbolic names

       For example:

            1.38/foo/bar/baz.txt,v  --> [TAG11, BRANCH38]
            1.93/foo/qux/bat.c,v    --> [TAG39]
            1.4/foo/bar/baz.txt,v   --> [BRANCH48, BRANCH37]
            1.18/foo/bar/quux.txt,v --> [TAG320, TAG1178]

Pass 5:
=======

Primarily, this pass gathers CVS revisions into Subversion revisions
(a Subversion revision is comprised of one or more CVS revisions)
before we actually begin committing (where "committing" means either
to a Subversion repository or to a dump file).

This pass does the following:

1. Creates a database file to map Subversion Revision numbers to their
   corresponding CVS Revisions (cvs2svn-svn-revnums-to-cvs-revs.db).
   Creates another database file to map CVS Revisions to their
   Subversion Revision numbers (cvs2svn-cvs-revs-to-svn-revnums.db).

2. When a file is copied to a symbolic name in cvs2svn, there are a
   range of valid Subversion revisions that we can copy the file from.
   The first valid Subversion revision number for a symbolic name is
   called the "Opening", and the first *invalid* Subversion revision
   number encountered after the "Opening" is called the "Closing".  In
   this pass, the SymbolingsLogger class writes one line to
   cvs2svn-symbolic-names.txt per CVS file, per symbolic name, per
   opening or closing.

3. For each CVS Revision in s-revs, we write out a line (for each
   symbolic name that it opens) to a symbolic-names.txt file if it is
   the first possible source revision (the "opening" revision) for a
   copy to create a branch or tag, or if it is the last possible
   revision (the "closing" revision) for a copy to create a branch or
   tag.  Not every opening will have a corresponding closing.

   The format of each line is:

       SYMBOLIC_NAME SVN_REVNUM TYPE CVSRevision.unique_key()

   For example:

       MY_TAG1 234 O 1.3/foo/bar/baz.txt,v
       MY_BRANCH3 245 O 1.13/foo/qux/bat.c,v
       MY_TAG1 241 C 1.4/foo/bar/baz.txt,v
       MY_BRANCH_BLAH 201 O 1.1/foo/bar/quux.txt,v

   Here is what the columns mean:

   SYMBOLIC_NAME: The name of the branch or tag that starts or ends
                  in this CVS Revision (There can be multiples per
                  CVS rev)

   SVN_REVNUM: The Subversion revision number that is the opening or
               closing for this SYMBOLIC_NAME.

   TYPE: "O" for Openings and "C" for Closings.

   CVSRevision.unique_key(): This is a unique key that identifies
                             the CVSRevision where this opening or
                             closing happened.

   See SymbolingsLogger for more details.

Pass 6:
=======

This pass merely sorts cvs2svn-symbolic-names.txt into
cvs2svn-symbolic-names-s.txt.  This orders the file first by symbolic
name, and second by Subversion revision number, thus grouping all
openings and closings for each symbolic name together.

Pass 7:
=======

This pass iterates through all the lines in
cvs2svn-symbolic-names-s.txt, writing out a database file mapping
SYMBOLIC_NAME to the file offset in SYMBOL_OPENINGS_CLOSINGS_SORTED
where SYMBOLIC_NAME is first encountered.  This will allow us to seek
to the various offsets in the file and sequentially read only the
openings and closings that we need.

Pass 8:
=======

The 8th pass will has very little "thinking" to do--it basically going
opens the svn-nums-to-cvs-revs.db and, starting with Subversion
revision 2 (revision 1 creates /trunk, /tags, and /branches), and
sequentially play out all the commits to either a Subversion
repository or to a dumpfile.

In --dump-only mode, the result of this pass is a Subversion
repository dumpfile (suitable for input to 'svnadmin load').  The
dumpfile is the data's last static stage: last chance to check over
the data, run it through svndumpfilter, move the dumpfile to another
machine, etc.

However, when not in --dump-only mode, no full dumpfile is created for
subsequent load into a Subversion repository.  Instead, miniature
dumpfiles represent a single revision are created, loaded into the
repository, and then removed.

In both modes, the dumpfile revisions are created by walking through
cvs2svn-data.s-revs.

                  ===============================
                      Branches and Tags Plan.
                  ===============================

This pass is also where tag and branch creation is done.  Since
subversion does tags and branches by copying from existing revisions
(then maybe editing the copy, making subcopies underneath, etc), the
big question for cvs2svn is how to achieve the minimum number of
operations per creation.  For example, if it's possible to get the
right tag by just copying revision 53, then it's better to do that
than, say, copying revision 51 and then sub-copying in bits of
revision 52 and 53.

Also, since CVS does not version symbolic names, there is the
secondary question of *when* to create a particular tag or branch.
For example, a tag might have been made at any time after the youngest
commit included in it, or might even have been made piecemeal; and the
same is true for a branch, with the added constraint that for any
particular file, the branch must have been created before the first
commit on the branch.

Answering the second question first: cvs2svn creates tags as soon as
possible and branches as late as possible.

Tags are created as soon cvs2svn encounters the last CVS Revision that
is a source for that tag.  The whole tag is created in one Subversion
commit.

For branches, this is "just in time" creation -- the moment it sees
the first commit on a branch, it snaps the entire branch into
existence (or as much of it as possible), and then outputs the branch
commit.

The reason we say "as much of it as possible" is that it's possible to
have a branch where some files have branch commits occuring earlier
than the other files even have the source revisions from which the
branch sprouts (this can happen if the branch was created piecemeal,
for example).  In this case, we create as much of the branch as we
can, that is, as much of it as there are source revisions available to
copy, and leave the rest for later.  "Later" might mean just until
other branch commits come in, or else during a cleanup stage that
happens at the end of this pass (about which more later).

How just-in-time branch creation works:

In order to make the "best" set of copies/deletes when creating a
branch, cvs2svn keeps track of two sets of trees while it's making
commits:

   1. A skeleton mirror of the subversion repository, that is, an
      array of revisions, with a tree hanging off each revision.  (The
      "array" is actually implemented as an anydbm database itself,
      mapping string representations of numbers to root keys.)

   2. A tree for each CVS symbolic name, and the svn file/directory
      revisions from which various parts of that tree could be copied.

Both tree sets live in anydbm databases, using the same basic schema:
unique keys map to marshal.dumps() representations of dictionaries,
which in turn map entry names to other unique keys:

   root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
   entrykey1 ==> { entrynameX : entrykeyX, ... }
   entrykey2 ==> { entrynameY : entrykeyY, ... }
   entrykeyX ==> { etc, etc ...}
   entrykeyY ==> { etc, etc ...}

(The leaf nodes -- files -- are also dictionaries, for simplicity.)

The repository mirror allows cvs2svn to remember what paths exist in
what revisions.

For details on how branches and tags are created, please see the
docstring the SymbolingsLogger class (and its methods).

-*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -
-*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-

Some older notes and ideas about cvs2svn.  Not deleted, because they
may contain suggestions for future improvements in design.

-----------------------------------------------------------------------

An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
considerations for the tool.

------
From: John Gardiner Myers <jgmyers@speakeasy.net>
Subject: Thoughts on CVS to SVN conversion
To: gstein@lyra.org
Date: Sun, 15 Apr 2001 17:47:10 -0700

Some things you may want to consider for a CVS to SVN conversion utility:

If converting a CVS repository to SVN takes days, it would be good for
the conversion utility to keep its progress state on disk.  If the
conversion fails halfway through due to a network outage or power
failure, that would allow the conversion to be resumed where it left off
instead of having to start over from an empty SVN repository.

It is a short step from there to allowing periodic updates of a
read-only SVN repository from a read/write CVS repository.  This allows
the more relaxed conversion procedure:

1) Create SVN repository writable only by the conversion tool.
2) Update SVN repository from CVS repository.
3) Announce the time of CVS to SVN cutover.
4) Repeat step (2) as needed.
5) Disable commits to CVS repository, making it read-only.
6) Repeat step (2).
7) Enable commits to SVN repository.
8) Wait for developers to move their workspaces to SVN.
9) Decomission the CVS repository.

You may forward this message or parts of it as you seem fit.
------

-----------------------------------------------------------------------

Further design thoughts from Greg Stein <gstein@lyra.org>

* timestamp the beginning of the process. ignore any commits that
  occur after that timestamp; otherwise, you could miss portions of a
  commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
  revision for items in B; we missed A)

* the above timestamp can also be used for John's "grab any updates
  that were missed in the previous pass."

* for each file processed, watch out for simultaneous commits. this
  may cause a problem during the reading/scanning/parsing of the file,
  or the parse succeeds but the results are garbaged. this could be
  fixed with a CVS lock, but I'd prefer read-only access.

  algorithm: get the mtime before opening the file. if an error occurs
  during reading, and the mtime has changed, then restart the file. if
  the read is successful, but the mtime changed, then restart the
  file.

* use a separate log to track unique branches and non-branched forks
  of revision history (Q: is it possible to create, say, 1.4.1.3
  without a "real" branch?). this log can then be used to create a
  /branches/ directory in the SVN repository.

  Note: we want to determine some way to coalesce branches across
  files. It can't be based on name, though, since the same branch name
  could be used in multiple places, yet they are semantically
  different branches. Given files R, S, and T with branch B, we can
  tie those files' branch B into a "semantic group" whenever we see
  commit groups on a branch touching multiple files. Files that are
  have a (named) branch but no commits on it are simply ignored. For
  each "semantic group" of a branch, we'd create a branch based on
  their common ancestor, then make the changes on the children as
  necessary. For single-file commits to a branch, we could use
  heuristics (pathname analysis) to add these to a group (and log what
  we did), or we could put them in a "reject" kind of file for a human
  to tell us what to do (the human would edit a config file of some
  kind to instruct the converter).

* if we have access to the CVSROOT/history, then we could process tags
  properly. otherwise, we can only use heuristics or configuration
  info to group up tags (branches can use commits; there are no
  commits associated with tags)

* ideally, we store every bit of data from the ,v files to enable a
  complete restoration of the CVS repository. this could be done by
  storing properties with CVS revision numbers and stuff (i.e. all
  metadata not already embodied by SVN would go into properties)

* how do we track the "states"? I presume "dead" is simply deleting
  the entry from SVN. what are the other legal states, and do we need
  to do anything with them?

* where do we put the "description"? how about locks, access list,
  keyword flags, etc.

* note that using something like the SourceForge repository will be an
  ideal test case. people *move* their repositories there, which means
  that all kinds of stuff can be found in those repositories, from
  wherever people used to run them, and under whatever development
  policies may have been used.

  For example: I found one of the projects with a "permissions 644;"
  line in the "gnuplot" repository.  Most RCS releases issue warnings
  about that (although they properly handle/skip the lines), and CVS
  ignores RCS newphrases altogether.


^ permalink raw reply

* Re: git-rebase nukes multiline comments
From: David Kowis @ 2006-06-16 17:55 UTC (permalink / raw)
  To: David Kowis; +Cc: git, mhopf
In-Reply-To: <4492E8F9.4000106@shlrm.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

David Kowis wrote:
<snip>
> 
> I'm new to git, but I tried what you said.
> my git log:
> commit c846bea8c61bec7cf0f7688c48abc42577b9ac7f
> Author: David Kowis <dkowis@kain.org>
> Date:   Fri Jun 16 12:20:08 2006 -0500
> 
>     this is a multi
> 
>     line comment
>     with three lines
> 
> 
> I'm using git 1.4.0. It added a blank line in there...

I'm going to note that the xorg ML cc doesn't work for anyone not
subscribed... You may miss out on replies.

- --
David Kowis

ISO Team Lead - www.sourcemage.org
Source Mage GNU/Linux

Progress isn't made by early risers. It's made by lazy men trying to
find easier ways to do something.
  - Robert Heinlein
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (MingW32)

iQGVAwUBRJLwn8nf+vRw63ObAQqhmwv7BXLqVSJa2FV6RVhLnmARqh+MHBAX+XLu
zgg/kcYd97pXz9bUEFEmY9tp3afzghA6EQlrV/zRHe/R/e1ZFjvTE27mUe3CvtHu
dUPgx6b85vMLkT2k6jbZ5BoA9KtbNITQlZnQJcEAMBv7aUrclRFykABnXwfh3YxM
jVOGbqoNaKzeB5/Sccb27xnzU91UjztB5X7yNgJYosO6tTz164bQQQbIMGWGztPw
wTwQOPK2+v4oUqfvYbKlX/Fd/Fve6PPWOAj5cUjxPHf47oiF/HY3ir/V/k04qO34
KFKAr10ss/sVm7kbURyj7AWJ/putgy9zzYzSWjqh+4ahTwIFb2ciPsU64o1MsO1K
Mnwz0IowmUUZO57qV0gkYdZyPvudOpV2v52aqMEhMyq8GU56Fvsy0KJma235Sv0r
D0ucIrrorCG0FyY7wKpEM83GJDBaTzxb/Mv8bjCD9/av1uQMjmMvqcPFWsZL+nRx
igTF8LiWzBBEG5b+PPjKlS8uofj8cW5g
=90TM
-----END PGP SIGNATURE-----

^ permalink raw reply

* [PATCH 1/4] git-svn: bugfix and optimize the 'log' command
From: Eric Wong @ 2006-06-16 17:57 UTC (permalink / raw)
  To: Junio C Hamano, git; +Cc: Eric Wong
In-Reply-To: <11504806463470-git-send-email-normalperson@yhbt.net>

Revisions with long commit messages were being skipped, since
the 'git-svn-id' metadata line was at the end and git-log uses a
32k buffer to print the commits.

Also the last 'git-svn-id' metadata line in a commit is always
the valid one, so make sure we use that, as well.

Made the verbose flag work by passing the correct option switch
('--summary') to git-log.

Finally, optimize -r/--revision argument handling by passing
the appropriate limits to revision

Signed-off-by: Eric Wong <normalperson@yhbt.net>
---
 contrib/git-svn/git-svn.perl |   60 ++++++++++++++++++++++++++++++++++++------
 1 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/contrib/git-svn/git-svn.perl b/contrib/git-svn/git-svn.perl
index 149149f..417fcf1 100755
--- a/contrib/git-svn/git-svn.perl
+++ b/contrib/git-svn/git-svn.perl
@@ -663,17 +663,15 @@ sub show_log {
 	my $pid = open(my $log,'-|');
 	defined $pid or croak $!;
 	if (!$pid) {
-		my @rl = (qw/git-log --abbrev-commit --pretty=raw
-				--default/, "remotes/$GIT_SVN");
-		push @rl, '--raw' if $_verbose;
-		exec(@rl, @args) or croak $!;
+		exec(git_svn_log_cmd($r_min,$r_max), @args) or croak $!;
 	}
 	setup_pager();
 	my (@k, $c, $d);
+
 	while (<$log>) {
 		if (/^commit ($sha1_short)/o) {
 			my $cmt = $1;
-			if ($c && defined $c->{r} && $c->{r} != $r_last) {
+			if ($c && cmt_showable($c) && $c->{r} != $r_last) {
 				$r_last = $c->{r};
 				process_commit($c, $r_min, $r_max, \@k) or
 								goto out;
@@ -692,8 +690,7 @@ sub show_log {
 		} elsif ($d) {
 			push @{$c->{diff}}, $_;
 		} elsif (/^    (git-svn-id:.+)$/) {
-			my ($url, $rev, $uuid) = extract_metadata($1);
-			$c->{r} = $rev;
+			(undef, $c->{r}, undef) = extract_metadata($1);
 		} elsif (s/^    //) {
 			push @{$c->{l}}, $_;
 		}
@@ -715,6 +712,52 @@ out:
 
 ########################### utility functions #########################
 
+sub cmt_showable {
+	my ($c) = @_;
+	return 1 if defined $c->{r};
+	if ($c->{l} && $c->{l}->[-1] eq "...\n" &&
+				$c->{a_raw} =~ /\@([a-f\d\-]+)>$/) {
+		my @msg = safe_qx(qw/git-cat-file commit/, $c->{c});
+		shift @msg while ($msg[0] ne "\n");
+		shift @msg;
+		@{$c->{l}} = grep !/^git-svn-id: /, @msg;
+
+		(undef, $c->{r}, undef) = extract_metadata(
+				(grep(/^git-svn-id: /, @msg))[-1]);
+	}
+	return defined $c->{r};
+}
+
+sub git_svn_log_cmd {
+	my ($r_min, $r_max) = @_;
+	my @cmd = (qw/git-log --abbrev-commit --pretty=raw
+			--default/, "refs/remotes/$GIT_SVN");
+	push @cmd, '--summary' if $_verbose;
+	return @cmd unless defined $r_max;
+	if ($r_max == $r_min) {
+		push @cmd, '--max-count=1';
+		if (my $c = revdb_get($REVDB, $r_max)) {
+			push @cmd, $c;
+		}
+	} else {
+		my ($c_min, $c_max);
+		$c_max = revdb_get($REVDB, $r_max);
+		$c_min = revdb_get($REVDB, $r_min);
+		if ($c_min && $c_max) {
+			if ($r_max > $r_max) {
+				push @cmd, "$c_min..$c_max";
+			} else {
+				push @cmd, "$c_max..$c_min";
+			}
+		} elsif ($r_max > $r_min) {
+			push @cmd, $c_max;
+		} else {
+			push @cmd, $c_min;
+		}
+	}
+	return @cmd;
+}
+
 sub fetch_child_id {
 	my $id = shift;
 	print "Fetching $id\n";
@@ -2206,6 +2249,7 @@ sub setup_pager { # translated to Perl f
 sub get_author_info {
 	my ($dest, $author, $t, $tz) = @_;
 	$author =~ s/(?:^\s*|\s*$)//g;
+	$dest->{a_raw} = $author;
 	my $_a;
 	if ($_authors) {
 		$_a = $rusers{$author} || undef;
@@ -2440,7 +2484,7 @@ sub svn_grab_base_rev {
 	close $fh;
 	if (defined $c && length $c) {
 		my ($url, $rev, $uuid) = extract_metadata((grep(/^git-svn-id: /,
-			safe_qx(qw/git-cat-file commit/, $c)))[0]);
+			safe_qx(qw/git-cat-file commit/, $c)))[-1]);
 		return ($rev, $c);
 	}
 	return (undef, undef);
-- 
1.4.0

^ permalink raw reply related

* [PATCH 0/4] git-svn: more improvements
From: Eric Wong @ 2006-06-16 17:57 UTC (permalink / raw)
  To: Junio C Hamano, git
In-Reply-To: <11504049343825-git-send-email-normalperson@yhbt.net>

Another round of patches to git-svn, all depending on previous patches.

I've also setup a pullable repo here for all my latest git-svn stuff:
	git://git.bogomips.org/git-svn.git
	(http:// also works)

-- 
Eric Wong

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox