packs and trees

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* packs and trees
@ 2006-06-20  5:57 Jon Smirl
  2006-06-20  6:13 ` Martin Langhoff
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Smirl @ 2006-06-20  5:57 UTC (permalink / raw)
  To: git

Converting from CVS would be a lot more efficient if all of revisions
contained in a CVS file were written into git at the same time. So, if
I extract complete revisions from 100 source files into git objects
and then ask git to incremental pack, will git find all of the deltas
and do a good job packing? Some of these files have thousands (50MB)
of deltas. Also, note that I have not written any tree info into git
yet.

After all of the revisions are into git, I will follow up with the
tree info and then repack all. How will the pack end up grouped,
chronologically or will it still be sorted by file? It is not clear to
me how the tree info interacts with the magic packing sauce.

The plan is to modify rcs2git from parsecvs to create all of the git
objects for the tree. It would be called by the cvs2svn code which
would track the object IDs through the changeset generation process.
At the end it will write all of the trees connecting the objects
together.

cvs2svn seems to do a good job at generating the trees. I am not
exactly sure how the changeset detection algorithms in the three apps
compare, but cvs2svn is not having any trouble building changesets for
Mozilla. The other two apps have some issues, cvsps throws away some
of the branches and parsecvs can't complete the analysis.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: packs and trees
  2006-06-20  5:57 packs and trees Jon Smirl
@ 2006-06-20  6:13 ` Martin Langhoff
  2006-06-20 14:35   ` Jon Smirl
  2006-06-20 15:03   ` Nicolas Pitre
  0 siblings, 2 replies; 10+ messages in thread
From: Martin Langhoff @ 2006-06-20  6:13 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On 6/20/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> The plan is to modify rcs2git from parsecvs to create all of the git
> objects for the tree.

Sounds like a good plan. Have you seen recent discussions about it
being impossible to repack usefully when you don't have trees (and
resulting performance problems on ext3).

> cvs2svn seems to do a good job at generating the trees.

No doubt. Gut the last stage, and use all the data in the intermediate
DBs to run a git import. It's a great plan, and if you can understand
that Python code... all yours ;-)

> exactly sure how the changeset detection algorithms in the three apps
> compare, but cvs2svn is not having any trouble building changesets for
> Mozilla. The other two apps have some issues, cvsps throws away some
> of the branches and parsecvs can't complete the analysis.

Have you tried a recent parsecvs from Keith's tree? There's been quite
a bit of activity there too. And Keith's interested in sorting out
incremental imports too, which you need for a reasonable Moz
transition plan as well.

cheers,

martin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: packs and trees
  2006-06-20  6:13 ` Martin Langhoff
@ 2006-06-20 14:35   ` Jon Smirl
  2006-06-20 15:18     ` Keith Packard
  2006-06-20 15:03   ` Nicolas Pitre
  1 sibling, 1 reply; 10+ messages in thread
From: Jon Smirl @ 2006-06-20 14:35 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git

On 6/20/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> On 6/20/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> > The plan is to modify rcs2git from parsecvs to create all of the git
> > objects for the tree.
>
> Sounds like a good plan. Have you seen recent discussions about it
> being impossible to repack usefully when you don't have trees (and
> resulting performance problems on ext3).

No, I will look back in the archives.  If needed we can do a repack
after each file is added. I would hope that git can handle a repack
when the new stuff is 100% deltas from a single file.

If I can't pack the exploded deltas need about 35GB disk space. That
is an awful lot to feed to pack all at once, but it will have trees,

>
> > cvs2svn seems to do a good job at generating the trees.
>
> No doubt. Gut the last stage, and use all the data in the intermediate
> DBs to run a git import. It's a great plan, and if you can understand
> that Python code... all yours ;-)

How hard would it be to adjust cvsps to use cvs2svn's algorithm for
grouping the changesets? I'd rather do this in a C app but I haven't
figured out the guts of parsecvs or cvsps well enough to change the
algorithms. There is no requirement to use external databases, sorting
everything in RAM is fine.

If you are interested in changing the cvsps grouping algorithm I can
look at moding it to write out the revisions as are they are parsed.
Then you only need to save the git sha1 in memory instead of the
file:rev when sorting.

> > exactly sure how the changeset detection algorithms in the three apps
> > compare, but cvs2svn is not having any trouble building changesets for
> > Mozilla. The other two apps have some issues, cvsps throws away some
> > of the branches and parsecvs can't complete the analysis.
>
> Have you tried a recent parsecvs from Keith's tree? There's been quite
> a bit of activity there too. And Keith's interested in sorting out
> incremental imports too, which you need for a reasonable Moz
> transition plan as well.

Keith's parsecvs run ended up in a loop and mine hit a parsecvs error
and then had memory corruption after about eight hours. That was last
week,  I just checked the logs and I don't see any comments about
fixing it.

Even after spending eight hours building the changeset info iit is
still going to take it a couple of days to retrieve the versions one
at a time and write them to git. Reparsing 50MB delta files n^2/2
times is a major bottleneck for all three programs.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: packs and trees
  2006-06-20  6:13 ` Martin Langhoff
  2006-06-20 14:35   ` Jon Smirl
@ 2006-06-20 15:03   ` Nicolas Pitre
  2006-06-20 19:41     ` Martin Langhoff
  1 sibling, 1 reply; 10+ messages in thread
From: Nicolas Pitre @ 2006-06-20 15:03 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git

On Tue, 20 Jun 2006, Martin Langhoff wrote:

> On 6/20/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> > The plan is to modify rcs2git from parsecvs to create all of the git
> > objects for the tree.
> 
> Sounds like a good plan. Have you seen recent discussions about it
> being impossible to repack usefully when you don't have trees (and
> resulting performance problems on ext3).

What do you mean?


Nicolas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: packs and trees
  2006-06-20 14:35   ` Jon Smirl
@ 2006-06-20 15:18     ` Keith Packard
  2006-06-20 16:33       ` Jon Smirl
  0 siblings, 1 reply; 10+ messages in thread
From: Keith Packard @ 2006-06-20 15:18 UTC (permalink / raw)
  To: Jon Smirl; +Cc: keithp, Martin Langhoff, git

[-- Attachment #1: Type: text/plain, Size: 1338 bytes --]

On Tue, 2006-06-20 at 10:35 -0400, Jon Smirl wrote:

> Keith's parsecvs run ended up in a loop and mine hit a parsecvs error
> and then had memory corruption after about eight hours. That was last
> week,  I just checked the logs and I don't see any comments about
> fixing it.

Yeah, I'm rewriting the tool; the current codebase isn't supportable.

> Even after spending eight hours building the changeset info iit is
> still going to take it a couple of days to retrieve the versions one
> at a time and write them to git. Reparsing 50MB delta files n^2/2
> times is a major bottleneck for all three programs.

The eight hours in question *were* writing out the deltas and packing
the resulting trees. All that remained was to construct actual commit
objects and write them out. 

The problem was that parsecvs's internals are structured so that this
processes would take a large amount of memory, so I'm reworking the code
to free stuff as it goes along.

With a rewritten parsecvs, I'm hoping to be able to steal the algorithms
from cvs2svn and stick those in place. Then work on truncating the
history so it can deal with incremental updates to the repository, which
I think will be straightforward if we stick a few breadcrumbs in the git
repository to recover state from.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: packs and trees
  2006-06-20 15:18     ` Keith Packard
@ 2006-06-20 16:33       ` Jon Smirl
  0 siblings, 0 replies; 10+ messages in thread
From: Jon Smirl @ 2006-06-20 16:33 UTC (permalink / raw)
  To: Keith Packard; +Cc: Martin Langhoff, git

On 6/20/06, Keith Packard <keithp@keithp.com> wrote:
> > Even after spending eight hours building the changeset info iit is
> > still going to take it a couple of days to retrieve the versions one
> > at a time and write them to git. Reparsing 50MB delta files n^2/2
> > times is a major bottleneck for all three programs.
>
> The eight hours in question *were* writing out the deltas and packing
> the resulting trees. All that remained was to construct actual commit
> objects and write them out.
>
> The problem was that parsecvs's internals are structured so that this
> processes would take a large amount of memory, so I'm reworking the code
> to free stuff as it goes along.

How about writing out all of the revisions from the cvs file using the
yacc code the first time the file is encountered and parsed. Then you
only have to track git IDs and not all of those cumbersome CVS rev
numbers. When I was profiling parsecvs the hottest parts of the code
were extracting the revisions and comparing cvs rev numbers. Since the
git IDs are fixed size they work well in arrays and with pointer
compares for sorting. With the right data structure you should be able
to eliminate the CVS rev numbers that are so slow to deal with.

There are about 1M revisions in moz cvs. At eight byes for an ID and
eight bytes for a timestamp that is 16MB if ordering is achieved via
arrays. All of the symbols fit into 400K including pointers to their
revision. If the revs are written out as they are encountered there is
no need to save file names, but you do need one rev structure per
file. Throw in some more memory for relationship pointers. All of this
should fit into less than 100MB RAM.

>
> With a rewritten parsecvs, I'm hoping to be able to steal the algorithms
> from cvs2svn and stick those in place. Then work on truncating the
> history so it can deal with incremental updates to the repository, which
> I think will be straightforward if we stick a few breadcrumbs in the git
> repository to recover state from.
>
> --
> keith.packard@intel.com
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.3 (GNU/Linux)
>
> iD8DBQBEmBHYQp8BWwlsTdMRAvKAAJ9im3xBdUowt9af+/MtoYDXsCHGtACaAtG4
> GygX7WgiFOamLrnTMzWkIPE=
> =28dp
> -----END PGP SIGNATURE-----
>
>
>

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: packs and trees
  2006-06-20 15:03   ` Nicolas Pitre
@ 2006-06-20 19:41     ` Martin Langhoff
  2006-06-20 20:51       ` Nicolas Pitre
  2006-06-21  3:54       ` Linus Torvalds
  0 siblings, 2 replies; 10+ messages in thread
From: Martin Langhoff @ 2006-06-20 19:41 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jon Smirl, git

On 6/21/06, Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 20 Jun 2006, Martin Langhoff wrote:
>
> > On 6/20/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> > > The plan is to modify rcs2git from parsecvs to create all of the git
> > > objects for the tree.
> >
> > Sounds like a good plan. Have you seen recent discussions about it
> > being impossible to repack usefully when you don't have trees (and
> > resulting performance problems on ext3).
>
> What do you mean?

I was thinking of the "repacking disconnected objects" thread, but now
I see it did have a solution in listing all the objects and paths. I
take that back.

If you are asking about the ext3 performance problems, I think Linus
discussed that a while ago, why unpacked repos are slow (in addition
to huge), and there were some suggestions of using hashed directory
indexes.

cheers,


martin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: packs and trees
  2006-06-20 19:41     ` Martin Langhoff
@ 2006-06-20 20:51       ` Nicolas Pitre
  2006-06-21  3:54       ` Linus Torvalds
  1 sibling, 0 replies; 10+ messages in thread
From: Nicolas Pitre @ 2006-06-20 20:51 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git

On Wed, 21 Jun 2006, Martin Langhoff wrote:

> On 6/21/06, Nicolas Pitre <nico@cam.org> wrote:
> > On Tue, 20 Jun 2006, Martin Langhoff wrote:
> > 
> > > On 6/20/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> > > > The plan is to modify rcs2git from parsecvs to create all of the git
> > > > objects for the tree.
> > >
> > > Sounds like a good plan. Have you seen recent discussions about it
> > > being impossible to repack usefully when you don't have trees (and
> > > resulting performance problems on ext3).
> > 
> > What do you mean?
> 
> I was thinking of the "repacking disconnected objects" thread, but now
> I see it did have a solution in listing all the objects and paths. I
> take that back.

OK.


Nicolas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: packs and trees
  2006-06-20 19:41     ` Martin Langhoff
  2006-06-20 20:51       ` Nicolas Pitre
@ 2006-06-21  3:54       ` Linus Torvalds
  2006-06-21 15:32         ` David Lang
  1 sibling, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2006-06-21  3:54 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Nicolas Pitre, Jon Smirl, git

On Wed, 21 Jun 2006, Martin Langhoff wrote:
> 
> If you are asking about the ext3 performance problems, I think Linus
> discussed that a while ago, why unpacked repos are slow (in addition
> to huge), and there were some suggestions of using hashed directory
> indexes.

Yes. I think most distros still default to nonhashed directories, but for 
any large-directory case you really want to turn on hashing. 

I forget the exact details, it's somethng like

	tune2fs -O dir_index

or something to turn it on (if I remember correctly, that will only affect 
any directories then created after that, but you can effect that by just 
doing a "git repack -a -d" which will remove all old object directories, 
and now subsequent directories will be done with indexing on).

Personally, I just ended up using packs extensively, so I think I'm still 
running without indexing on all my machines ;)

		Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: packs and trees
  2006-06-21  3:54       ` Linus Torvalds
@ 2006-06-21 15:32         ` David Lang
  0 siblings, 0 replies; 10+ messages in thread
From: David Lang @ 2006-06-21 15:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, Nicolas Pitre, Jon Smirl, git

there are performance penalties in some cases when you use directory hashing. I 
did some tests last year of lots of small files in a directory tree (X/X/X/XXX 
of 1k files) and I found that turning on directory hashing actually slowed down 
the creation of this tree from a tarball significantly (I also found that in 
this case the ext2 allocation strategy was significantly better then the ext3 
one), so test on your system.

David Lang


  On Tue, 20 Jun 2006, Linus Torvalds wrote:

> Date: Tue, 20 Jun 2006 20:54:01 -0700 (PDT)
> From: Linus Torvalds <torvalds@osdl.org>
> To: Martin Langhoff <martin.langhoff@gmail.com>
> Cc: Nicolas Pitre <nico@cam.org>, Jon Smirl <jonsmirl@gmail.com>,
>     git <git@vger.kernel.org>
> Subject: Re: packs and trees
> 
>
>
> On Wed, 21 Jun 2006, Martin Langhoff wrote:
>>
>> If you are asking about the ext3 performance problems, I think Linus
>> discussed that a while ago, why unpacked repos are slow (in addition
>> to huge), and there were some suggestions of using hashed directory
>> indexes.
>
> Yes. I think most distros still default to nonhashed directories, but for
> any large-directory case you really want to turn on hashing.
>
> I forget the exact details, it's somethng like
>
> 	tune2fs -O dir_index
>
> or something to turn it on (if I remember correctly, that will only affect
> any directories then created after that, but you can effect that by just
> doing a "git repack -a -d" which will remove all old object directories,
> and now subsequent directories will be done with indexing on).
>
> Personally, I just ended up using packs extensively, so I think I'm still
> running without indexing on all my machines ;)
>
> 		Linus
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-06-21 15:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-20  5:57 packs and trees Jon Smirl
2006-06-20  6:13 ` Martin Langhoff
2006-06-20 14:35   ` Jon Smirl
2006-06-20 15:18     ` Keith Packard
2006-06-20 16:33       ` Jon Smirl
2006-06-20 15:03   ` Nicolas Pitre
2006-06-20 19:41     ` Martin Langhoff
2006-06-20 20:51       ` Nicolas Pitre
2006-06-21  3:54       ` Linus Torvalds
2006-06-21 15:32         ` David Lang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).