git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Idea for git-fast-import
@ 2007-07-20  6:59 Michael Haggerty
  2007-07-20  7:28 ` Shawn O. Pearce
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Haggerty @ 2007-07-20  6:59 UTC (permalink / raw)
  To: git, spearce

I'm working on a git backend for cvs2svn and had an idea for
git-fast-import that would make life a tiny bit easier:

Currently, git-fast-import marks are positive integers.  But they are
used for two things: marking single-file blobs, and marking commits.

This is a tiny bit awkward, because cvs2svn assigns small integer IDs to
these things too, but uses distinct (overlapping) integer series for the
two concepts.  If it would be trivial to split the marks into two
"namespaces" (one for single-file blobs and one for commits), that would
make things a little bit more natural.  I don't think commit marks can
be used interchangeably with blob marks anyway, so it wouldn't be a
backwards incompatibility.

Without this feature, I will have to assign a new "mark" integer series
that is unrelated to cvs2svn's IDs, which is no big deal at all but will
make debugging a little bit harder.  So only add this feature if it is
really easy for you.

Also, is there a big cost to using "not-quite-consecutive" integers as
marks?  cvs2svn's CVSRevision IDs are intermingled with IDs for
CVSBranches and CVSTags, so the CVSRevisions alone probably only pack
the ID space 5%-50% full.

In fact, if there is a big cost to "not-quite-consecutive" integers,
then I withdraw my request for separate mark namespaces, since I would
have to reallocate mark numbers anyway :-)

Another thing that might help with debugging would be a "comment"
command, which git-fast-import should ignore.  One could put text about
the source of a chunk of git-fast-import stream to relate it back to the
front-end concepts when debugging the stream contents by hand.

[I will be out of town until Monday, so don't be surprised that I don't
respond right away :-) ]

Thanks,
Michael

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Idea for git-fast-import
  2007-07-20  6:59 Idea for git-fast-import Michael Haggerty
@ 2007-07-20  7:28 ` Shawn O. Pearce
  2007-07-22 18:35   ` Michael Haggerty
  0 siblings, 1 reply; 3+ messages in thread
From: Shawn O. Pearce @ 2007-07-20  7:28 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git

Michael Haggerty <mhagger@alum.mit.edu> wrote:
> I'm working on a git backend for cvs2svn and had an idea for
> git-fast-import that would make life a tiny bit easier:

Cool!
 
> Currently, git-fast-import marks are positive integers.  But they are
> used for two things: marking single-file blobs, and marking commits.
> 
> This is a tiny bit awkward, because cvs2svn assigns small integer IDs to
> these things too, but uses distinct (overlapping) integer series for the
> two concepts.  If it would be trivial to split the marks into two
> "namespaces" (one for single-file blobs and one for commits), that would
> make things a little bit more natural.  I don't think commit marks can
> be used interchangeably with blob marks anyway, so it wouldn't be a
> backwards incompatibility.

That's true, they aren't interchangeable.  fast-import pukes
and dies if you try to use the wrong type at the wrong location.
It has been requested before that the two namespaces be split,
and I just have been too lazy to do it.

> Without this feature, I will have to assign a new "mark" integer series
> that is unrelated to cvs2svn's IDs, which is no big deal at all but will
> make debugging a little bit harder.  So only add this feature if it is
> really easy for you.

Its not that much code reorg, but there is some reorg required to
make it work.  Maybe only a few hundred line diff, so probably well
within reason.  I'll look into it later.
 
> Also, is there a big cost to using "not-quite-consecutive" integers as
> marks?  cvs2svn's CVSRevision IDs are intermingled with IDs for
> CVSBranches and CVSTags, so the CVSRevisions alone probably only pack
> the ID space 5%-50% full.

Marks cost exactly 1 pointer (4 or 8 bytes) as they are actually just
a pointer to the already-in-memory object metadata that fast-import
uses for bookkeeping related to packfile generation.  Gaps in the
marks sequence also cost exactly 1 pointer, as they are just NULL.

But the marks table is actually a sparse array, using 1024 entries
per block.  So if you assign a mark at :5, then another at say
:1047000 you have only allocated 3 blocks and 12 KiB of memory
(a root directory block at 4 KiB, two leafs at 4 KiB each).  A far
cry from 4 MiB.

Its not a binary tree, its a sparse digital index.  So going
really far out in the namespace with huge gaps will cost you some
index nodes.  Staying reasonably dense is actually quite efficient,
with pretty low directory overheads.

> In fact, if there is a big cost to "not-quite-consecutive" integers,
> then I withdraw my request for separate mark namespaces, since I would
> have to reallocate mark numbers anyway :-)

See above.  5% full is really bad, because you are probably going to
allocate nearly every block in the directory, and only fill each leaf
block at 5% full.  50% full is actually reasonable, as it means marks
are only costing you about 2 pointers on average (8 or 16 bytes).

I went with the sparse array/digital index approach because it is
fairly compact code, quick store and lookup operations, and I figured
most frontends could get at least 50% full on their mark allocation.
On really dense allocations (>60%) the very low overhead per mark
makes it insanely efficient, even for a very large number of marks.

Jon Smirl was dumping marks sequentially from his hacked cvs2svn,
thereby getting the marks table at 100% full.  Other recent import
attempts with fast-import have also managed to keep their mark
allocations pretty close (if not dead on) at 100% full.

I can see how it might be convenient to have a very sparsely filled
mark namespace.  Its also convenient to have a mark namespace that
uses arbitrary strings.  Unfortunately I chose not to support
those very well (or at all!) for the sake of trying to keep the
fast-import code more compact internally, and to simplify its
internal memory management.  You might be able to talk me into
improving on that however.  ;-)
 
> Another thing that might help with debugging would be a "comment"
> command, which git-fast-import should ignore.  One could put text about
> the source of a chunk of git-fast-import stream to relate it back to the
> front-end concepts when debugging the stream contents by hand.

This is an awesome idea, especially when combined with having a
buffer of the last few commands that fast-import saw right before
it crashed.  I'll see what I can do.
 
-- 
Shawn.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Idea for git-fast-import
  2007-07-20  7:28 ` Shawn O. Pearce
@ 2007-07-22 18:35   ` Michael Haggerty
  0 siblings, 0 replies; 3+ messages in thread
From: Michael Haggerty @ 2007-07-22 18:35 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

Shawn O. Pearce wrote:
> Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> Also, is there a big cost to using "not-quite-consecutive" integers as
>> marks?  cvs2svn's CVSRevision IDs are intermingled with IDs for
>> CVSBranches and CVSTags, so the CVSRevisions alone probably only pack
>> the ID space 5%-50% full.
> 
>> In fact, if there is a big cost to "not-quite-consecutive" integers,
>> then I withdraw my request for separate mark namespaces, since I would
>> have to reallocate mark numbers anyway :-)
> 
> See above.  5% full is really bad, because you are probably going to
> allocate nearly every block in the directory, and only fill each leaf
> block at 5% full.  50% full is actually reasonable, as it means marks
> are only costing you about 2 pointers on average (8 or 16 bytes).

OK, then, never mind.  As I mentioned, it is not a big deal to have
cvs2svn generate a separate integer series for marks.  If comments are
implemented, then the debugging disadvantage is also quite minor.

>> Another thing that might help with debugging would be a "comment"
>> command, which git-fast-import should ignore.  One could put text about
>> the source of a chunk of git-fast-import stream to relate it back to the
>> front-end concepts when debugging the stream contents by hand.
> 
> This is an awesome idea, especially when combined with having a
> buffer of the last few commands that fast-import saw right before
> it crashed.  I'll see what I can do.

Thanks!

Michael

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-07-22 18:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-20  6:59 Idea for git-fast-import Michael Haggerty
2007-07-20  7:28 ` Shawn O. Pearce
2007-07-22 18:35   ` Michael Haggerty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).